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Abstract 

This  thesis  introduces  some  formal  techniques  which  can  be  used  for  synthesis  of  VLSI 
(very  large  scale  integration)  architectures  for  DSP  (digital  signal  processing)  algorithms. 
These  techniques  can  be  used  to  design  architectures  for  single-rate/single-dimensional 
DSP,  multirate/single-dimensional  DSP,  and  single-rate/multi-dimensional  DSP. 

For  single-rate/single-dimcnsional  DSP,  we  have  developed  a  novel  technique  for  ex¬ 
haustively  generating  all  retiming  and  scheduling  solutions  for  the  DSP  algorithm.  The 
significance  of  this  contribution  is  two-fold.  First,  it  allows  a  circuit  designer  to  explore 
a  large  space  of  possible  high-level  implementations  for  the  algorithm,  which  allows  the 
designer  to  make  a  good  decision  about  the  high-level  architectural  details  of  the  de¬ 
sign.  Second,  this  work  explicitly  shows  the  important  interaction  between  retiming  and 
scheduling  in  high-level  synthesis.  While  retiming  and  scheduling  have  been  treated  as 
separate  problems  in  the  past,  our  work  uses  a  mathematical  framework  to  show  that 
retiming  is  a  special  case  of  scheduling. 

Also  for  single-rate/single-dimensional  DSP,  we  have  developed  techniques  for  com¬ 
puting  the  minimum  number  of  registers  required  to  implement  a  statically  scheduled 
DSP  program.  Closed-form  expressions  are  derived  for  computing  the  minimum  number 
of  registers  assuming  various  memory  models  with  or  without  retiming  the  scheduled 
DFG.  This  is  an  important  problem  because  memory  typically  occupies  a  large  portion 
of  the  area  of  a  DSP  implementation  (often  over  half  of  the  area),  and  minimizing  this 
area  leads  to  more  efficient  designs. 

For  multirate/single-dimensional  DSP,  we  have  developed  a  multirate  folding  tech¬ 
nique  which  can  be  used  to  synthesize  single-rate  architectures  from  multirate  DSP 
algorithms.  Prior  to  the  development  of  this  formal  technique,  the  design  of  single-rate 
architectures  for  multi-rate  DSP  algorithms  was  performed  using  ad  hoc  design  tech¬ 
niques. 

For  single- rate/multi-dimensional  DSP,  we  have  developed  two  techniques  for  retim¬ 
ing  two-dimensional  data-flow  graphs.  These  techniques  are  designed  to  minimize  the 
memory  requirements  under  a  given  clock  period  constraint.  These  techniques  can  result 
in  retimed  circuits  which  use  less  than  50%  of  the  memory  required  by  previously  used 
techniques. 
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This  thesis  introduces  some  formal  techniques  which  can  be  used  for  synthesis  of  VLSI 
(very  large  scale  integration)  architectures  for  DSP  (digital  signal  processing)  algorithms. 
These  techniques  can  be  used  to  design  architectures  for  single- rate /single-dimensional 
DSP,  multirate/single-dimensional  DSP,  and  single-rate/multi-dimensional  DSP. 

For  single-rate/ single-dimensional  DSP,  we  have  developed  a  novel  technique  for  ex¬ 
haustively  generating  all  retiming  and  scheduling  solutions  for  the  DSP  algorithm.  The 
significance  of  this  contribution  is  two-fold.  First,  it  allows  a  circuit  designer  to  explore 
a  large  space  of  possible  high-level  implementations  for  the  algorithm,  which  allows  the 
designer  to  make  a  good  decision  about  the  high-level  architectural  details  of  the  de¬ 
sign.  Second,  this  work  explicitly  shows  the  important  interaction  between  retiming  and 
scheduling  in  high-level  synthesis.  While  retiming  and  scheduling  have  been  treated  as 
separate  problems  in  the  past,  our  work  uses  a  mathematical  framework  to  show  that 
retiming  is  a  special  case  of  scheduling. 

Also  for  single-rate/single-dimensional  DSP,  we  have  developed  techniques  for  com¬ 
puting  the  minimum  number  of  registers  required  to  implement  a  statically  scheduled 
DSP  program.  Closed-form  expressions  are  derived  for  computing  the  minimum  number 
of  registers  assuming  various  memory  models  with  or  without  retiming  the  scheduled 
DFG.  This  is  an  important  problem  because  memory  typically  occupies  a  large  portion 
of  the  area  of  a  DSP  implementation  (often  over  half  of  the  area),  and  minimizing  this 
area  leads  to  more  efficient  designs. 

For  multirate/single-dimensional  DSP,  we  have  developed  a  multirate  folding  tech¬ 
nique  which  can  be  used  to  synthesize  single-rate  architectures  from  multirate  DSP 
algorithms.  Prior  to  the  development  of  this  formal  technique,  the  design  of  single-rate 
architectures  for  multi-rate  DSP  algorithms  was  performed  using  ad  hoc  design  tech¬ 
niques. 

For  single-rate/multi-dimensional  DSP,  we  have  developed  two  techniques  for  retim¬ 
ing  two-dimensional  data-flow  graphs.  These  techniques  are  designed  to  minimize  the 
memory  requirements  under  a  given  clock  period  constraint.  These  techniques  can  result 
in  retimed  circuits  which  use  less  than  50%  of  the  memory  required  by  previously  used 
techniques. 
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Chapter  1 


Introduction 

1.1  Overview 

This  thesis  introduces  some  formal  techniques  which  can  be  used  for  the  synthesis  of 
VLSI  [1,  2]  (very  large  scale  integration)  architectures  for  DSP  [3,  4,  5,  6]  (digital  signal 
processing)  algorithms.  DSP  is  used  in  many  applications  such  as  compact  disc  players, 
digital  television,  videoconferencing  systems,  digital  telephony,  radar,  and  sonar,  just 
to  name  a  few.  VLSI  architectures  for  DSP  algorithms  must  be  designed  to  satisfy 
constraints  on  the  sampling  rate,  chip  size,  and  power  consumption.  Without  adequate 
implementations,  DSP  algorithms  would  not  be  useful  to  consumers. 

Figure  1.1  shows  a  simplified  version  of  the  process  of  generating  a  silicon  solution 
for  a  given  application.  There  are  three  main  steps  in  this  process.  The  first  step  is 
to  develop  or  choose  the  proper  DSP  algorithm  for  the  application.  The  second  step 
is  high-level  synthesis  [7]  -[26],  which  maps  the  algorithm  to  a  VLSI  architecture,  and 
the  third  step  is  low-level  synthesis,  which  maps  the  VLSI  architecture  to  silicon.  These 
three  steps  are  not  independent,  and  it  has  become  apparent  that  a  good  understanding 
of  all  three  of  these  steps  is  required  to  design  an  efficient  silicon  solution  for  a  given 
application.  The  focus  of  this  thesis,  as  indicated  in  the  figure,  is  on  the  area  of  high- 
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level  synthesis,  i.e.,  designing  high-level  VLSI  architectures  for  DSP  algorithms.  The 
formal  techniques  introduced  in  this  thesis  help  provide  a  better  understanding  of  the 


algorithm  —>■  architecture  step  and  provide  new  techniques  for  mapping  algorithms  to 
architectures. 

General  Procedure  Example 


This  Thesis^ 


Single-phase  clock 
Wallace  Tree  mult. 
Carry-select  adder 


Figure  1.1:  A  simplified  version  of  the  design  process  from  application  to  silicon. 


As  DSP  algorithms  become  more  complex  and  transistor  sizes  become  smaller,  the 
tasks  of  designing  and  testing  VLSI  architectures  for  DSP  have  become  very  challenging 
due  to  the  sheer  size  of  these  tasks.  In  order  for  products  to  be  introduced  in  a  timely 
manner,  CAD  (computer-aided  design)  tools  [8,  26,  24,  16,  10,  12,  20,  22,  23,  14,  15]  are 
often  required.  These  tools  not  only  decrease  design  time,  but  they  also  make  the  design 
process  more  tractable,  improving  the  reliability  of  the  final  VLSI  design.  These  CAD 
tools  are  based  on  formal  design  techniques  which  can  be  used  to  automate  the  process 
of  synthesizing  VLSI  architectures  for  DSP  algorithms. 
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Some  formal  techniques  for  synthesizing  VLSI  architectures  for  DSP  algorithms  are 
introduced  in  this  thesis.  These  techniques  can  be  used  to  explore  new  VLSI  designs  for 
DSP  algorithms  and  improve  CAD  tools  which  are  used  to  design  VLSI  architectures  for 
DSP  algorithms.  A  description  of  these  techniques  is  given  in  the  following  section. 

1.2  Contributions 

The  contributions  of  this  thesis  fall  into  the  categories  of  retiming  [27],  folding  [28],  and 
register  minimization  [29].  A  concise  description  of  these  contributions  follows. 

•  Retiming 

-  Exhaustive  retiming;  A  novel  technique  for  exhaustively  generating  all  re¬ 
timing  solutions  for  a  DFG  is  developed.  This  technique,  which  is  based  on 
the  ideas  in  [30],  [31],  allows  a  circuit  designer  to  examine  many  retiming 
solutions  rather  than  a  single  solution  which  is  generated  using  a  heuristic  or 
an  optimization  scheme.  This  is  useful  because  it  is  easy  to  select  the  best 
retimed  solution  optimized  for  circuit  parameters,  such  as  routing  area,  from 
all  retiming  solutions. 

-  Two-dimensional  retiming:  Two  novel  techniques  are  developed  for  retiming 
two-dimensional  data-flow  graphs  (DFGs)  to  minimize  the  memory  require¬ 
ments  under  a  given  clock  period  constraint.  These  two  techniques  are  inte¬ 
ger  linear  programming  (ILP)  2-D  retiming  and  orthogonal  2-D  retiming  [32]. 
These  techniques  offer  greater  flexibility  than  the  technique  proposed  in  [33], 
and  they  can  reduce  the  memory  requirement  of  retimed  circuits  by  over  50% 
compared  to  the  technique  in  [34]. 
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-  Multirate  retiming;  Multirate  retiming  constraints  are  formalized  as  part  of 
the  multirate  folding  formulation.  Multirate  retiming  has  received  little  at¬ 
tention  in  the  past,  and  most  of  the  previous  work  has  been  focused  on  main¬ 
taining  properties  such  as  liveness  and  reachability  in  synchronous  data-flow 
graphs  (e.g.,  see  [35]).  The  treatment  of  multirate  retiming  in  this  thesis  con¬ 
siders  the  problem  at  a  more  fundamental  level  by  using  some  simple  identities 
of  multirate  DSP  [5].  We  show  that  our  multirate  retiming  formulation  is  use¬ 
ful  for  high-level  synthesis  of  single-rate  VLSI  architectures  for  multirate  DSP 
algorithms  [36]. 

•  Folding 

—  Exhaustive  Scheduling:  A  novel  technique  for  exhaustively  generating  all  time 
schedules  for  folding  a  DFG  is  developed  [31].  This  technique,  termed  “ex¬ 
haustive  scheduling”,  has  three  important  features.  First,  it  shows  the  im¬ 
portant  interaction  between  retiming  and  scheduling  in  a  solid  mathematical 
framework.  Retiming  and  scheduling  have  only  recently  been  considered  to¬ 
gether  [11,  26,  12,  37,  38],  and  none  of  these  works  has  given  a  mathematical 
framework  for  demonstrating  how  retiming  and  scheduling  interact  in  high- 
level  synthesis.  Second,  our  mathematical  framework  can  be  used  to  show 
that  retiming  is  simply  a  special  Ccise  of  scheduling.  Many  reseairchers  have 
thought  this  to  be  true  for  a  long  time,  but  none  have  shown  this  mathe¬ 
matically.  Finally,  exhaustive  scheduling  allows  a  circuit  designer  the  option 
of  evaluating  several  different  schedules  for  characteristics  that  are  diflBcult 
to  include  in  heuristics  [12,  15,  26]  or  ILP  models  [39,  40,  22,  37]  used  for 
scheduling. 
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-  Multirate  folding:  A  novel  technique  for  folding  multirate  DSP  algorithms  is 
developed  [36].  This  technique  maps  multirate  DSP  algorithms  to  single-rate 
VLSI  architectures.  For  example,  multirate  folding  can  be  used  to  design 
single-rate  architectures  for  algorithms  which  use  multirate  filter  banks,  such 
as  the  discrete  wavelet  transform  (DWT)  [41,  42,  43,  44,  45].  Prior  to  the 
development  of  multirate  folding,  single-rate  VLSI  architectures  for  multirate 
DSP  algorithms  were  designed  using  ad  hoc  design  techniques.  Multirate 
folding  provides  a  vehicle  for  systematically  designing  improved  architectures 
for  multirate  DSP  algorithms. 

•  Register  Minimization 

-  Single-rate  register  minimization:  Expressions  are  derived  for  computing  the 
minimum  number  of  registers  required  to  implement  a  statically  scheduled 
single-rate  DSP  algorithm  [46].  To  the  best  of  our  knowledge,  no  such  expres¬ 
sions  existed  prior  to  this  work.  Expressions  are  derived  for  three  different 
memory  models.  These  expressions  can  be  used  in  CAD  tools  to  evaluate 
the  quality  of  schedules  with  respect  to  memory  requirements.  For  example, 
these  expressions  are  used  along  with  our  exhaustive  scheduling  technique  to 
determine  the  schedules  which  require  the  minimum  number  of  registers. 

-  Multirate  register  minimization:  Expressions  are  derived  for  computing  the 
minimum  number  of  registers  required  to  implement  a  statically  scheduled 
multirate  DSP  algorithm.  This  novel  approach  to  evaluating  memory  re¬ 
quirements  allows  for  the  design  of  memory-efficient  single-rate  architectures 
for  the  implementation  of  multirate  DSP  algorithms. 
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1.3  Outline 


This  thesis  is  organized  as  follows.  The  exhaustive  retiming  and  scheduling  algorithms 
are  developed  in  Chapter  2.  This  chapter  also  provides  a  background  information  on 
retiming  and  folding.  Register  minimization  for  statically  scheduled  single-rate  data-flow 
graphs  is  considered  in  Chapter  3.  Chapter  4  contains  the  derivation  of  the  multirate 
folding  transformation,  including  the  work  on  retiming  for  multirate  folding  and  register 
minimization  for  folded  multirate  DSP  algorithms.  The  two-dimensional  retiming  tech¬ 
niques  are  derived  in  Chapter  5,  and  conclusions  and  suggestions  for  future  research  are 
presented  in  Chapter  6. 
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Chapter  2 


Exhaustive  Retiming  and 
Scheduling 

2.1  Introduction 

Time  scheduling  and  retiming  [27]  are  important  tools  used  to  map  behavioral  descrip¬ 
tions  of  algorithms  to  physical  realizations.  These  tools  are  used  during  the  design  of 
software  for  programmable  digital  signal  processors  (DSPs),  during  high-level  synthesis 
of  applications-specific  integrated  circuits  (ASICs),  and  during  the  design  of  reconfig- 
urable  hardware  such  as  field-programmable  gate  arrays  (FPGAs).  Time  scheduling  and 
retiming  operate  directly  on  a  behavioral  description  of  the  algorithm,  such  as  a  data¬ 
flow  graph  (DFG).  Since  the  decisions  made  at  the  algorithmic  level  tend  to  have  greater 
impact  on  the  design  than  those  made  at  lower  levels,  the  importance  of  time  scheduling 
and  retiming  cannot  be  overstated. 

This  chapter  presents  new  formulations  of  the  time  scheduling  and  retiming  problems, 
and  based  on  these  formulations,  new  techniques  are  developed  to  determine  the  solu¬ 
tions  to  these  problems  [31].  (From  this  point  forward,  we  shall  refer  to  time  scheduling 
as  simply  scheduling.)  These  formulations  are  valid  for  strongly  connected  (SC)  graphs, 
where  a  strongly  connected  graph  has  a  path  u  u  and  a  path  u  y  for  every  pair  of 
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nodes  u,  v  in  the  graph.  We  focus  on  strongly  connected  graphs  because  these  graphs 
traditionally  present  the  greatest  challenges  when  they  are  mapped  to  physical  realiza¬ 
tions  due  to  the  feedback  present  in  the  graphs.  An  example  of  a  strongly  connected 
DFG  is  the  fifth-order  wave  digital  elliptic  filter  [47]  in  Figure  2.18  which  is  commonly 
used  as  a  benchmark  for  demonstrating  high-level  synthesis  techniques. 

Scheduling  consists  of  assigning  execution  times  to  the  operations  in  a  DFG  such 
that  the  precedence  constraints  of  the  DFG  are  not  violated.  A  great  deal  of  litera¬ 
ture  exists  on  the  topic  of  scheduling  in  the  context  of  high-level  synthesis  for  ASIC 
design  for  DSP  applications  [7]  -[26];  however,  none  of  these  works  gives  a  formal  def¬ 
inition  of  scheduling  along  with  systematic  techniques  for  exhaustively  generating  the 
solutions  to  the  scheduling  problem.  This  chapter  presents  new  scheduling  formulations 
and  algorithms  for  exhaustively  generating  the  solutions  to  the  scheduling  problem.  Two 
scheduling  problems  are  considered,  namely,  scheduling  for  time-multiplexed  execution 
on  bit  parallel  architectures  and  scheduling  for  execution  on  bit-serial  architectures. 

Retiming  consists  of  moving  delays  around  in  a  DFG  without  changing  its  function¬ 
ality.  As  with  scheduling,  there  is  a  huge  body  of  literature  on  retiming,  and  new 
applications  for  retiming  are  constantly  being  found.  For  example,  due  to  the  recent 
demand  for  low-power  digital  circuits  in  portable  devices,  some  recent  work  has  focused 
on  retiming  for  power  minimization  [48].  The  groundbreaking  paper  on  retiming  [27] 
describes  algorithms  for  tasks  such  as  retiming  to  minimize  the  clock  period  and  retim¬ 
ing  to  minimize  the  number  of  registers  (states)  in  the  retimed  circuit.  An  approach  to 
retiming  which  is  based  on  circuit  theory  can  be  used  to  generate  all  retiming  solutions 
for  a  DFG  [30].  This  approach  was  the  motivation  for  our  work  on  exhaustive  scheduling. 
In  this  chapter,  we  show  that  retiming  is  a  special  case  of  scheduling,  and  consequently, 
the  formulation  of  the  scheduling  problem  and  the  techniques  for  exhaustively  generating 
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the  scheduling  solutions  can  also  be  applied  to  retiming. 

The  impact  of  the  formulations  derived  in  this  chapter  are  as  follows. 

•  The  interaction  between  retiming  and  scheduling  is  important  [11],  and  our  formu¬ 
lations  give  a  simple  way  to  observe  this  interaction. 

•  We  show  that  retiming  is  a  special  case  of  scheduling. 

•  We  give  solid  mathematical  descriptions  of  the  scheduling  and  retiming  problems 
in  a  common  framework. 

•  We  develop  techniques  for  generating  all  solutions  to  a  particular  scheduling  or 
retiming  problem.  This  allows  a  developer  the  ability  to  search  the  design  space 
for  the  best  solution,  particularly  when  various  parameters  are  difficult  to  model 
and  include  in  a  cost  function.  This  has  applications  to  software  design,  ASIC 
design,  and  design  for  reconfigurable  hardware  implementations. 

t  Our  formulations  provide  for  a  better  understanding  of  scheduling  and  retiming 
which  can  be  used  to  develop  new  heuristics  for  these  problems. 

Many  of  the  results  in  this  chapter  rely  upon  graph  theory.  Section  2.2  gives  a  review 
of  some  results  from  graph  theory  along  with  the  derivation  of  an  algorithm  for  finding 
the  independent  loops  in  a  strongly  connected  directed  graph.  Our  formulations  for 
scheduling  to  bit-parallel  and  bit-serial  architectures  are  given  in  Section  2.3  along  with 
an  explanation  of  how  retiming  can  be  viewed  as  a  special  case  of  scheduling.  Section  2.4 
contains  the  description  of  a  systematic  technique  used  to  exhaustively  generate  the 
scheduling  and  retiming  solutions.  Section  2.5  describes  two  techniques  for  exhaustively 
generating  the  schedules  which  satisfy  a  given  set  of  resource  constraints  for  a  bit-parallel 
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architecture.  Section  2.5  includes  the  results  of  scheduling  the  fifth-order  wave-digital 
elliptic  filter  in  Figure  2.18  with  and  without  resource  constraints.  Our  conclusions  are 
given  in  Section  2.6. 

2.2  Introduction  to  Graph  Theory 

This  section  provides  a  brief  introduction  to  graph  theory  followed  by  an  algorithm 
for  finding  the  independent  loops  in  a  strongly  connected  directed  graph.  Most  of  the 
definitions  and  results  in  Sections  2.2.1  and  2.2.2  can  be  found  in  [49]. 

2.2.1  Basic  Definitions 

We  are  concerned  only  with  directed  graphs.  A  directed  graph  G  is  represented  as 
G  =<  V,E,d,w  >,  where 

•  V  is  the  set  of  vertices  (nodes)  of  G.  The  vertices  represent  computations. 

•  E  is  the  set  of  directed  edges  of  G.  A  directed  edge  e  6  £<  from  node  u  G  K  to 
node  n  G  is  denoted  as  u  v.  The  edges  represent  communication  between  the 
nodes. 

•  w{e)  is  the  number  of  delays  on  the  edge  e,  also  referred  to  as  the  weight  of  the 
edge. 

•  d{v)  is  the  computation  time  of  the  node  v. 

A  directed  path  wq  '*^1  Vn-i  Vn  is  denoted  as  uq  Vn.  A  simple  path 

is  a  path  with  distinct  edges,  and  an  elementary  path  has  distinct  nodes.  A  cycle  is  a 
closed  path  (i.e.,  Vg  =  Vn).  A  simple  cycle  has  distinct  edges  and  an  elementary  cycle  has 
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distinct  nodes.  An  elementary  cycle  in  a  directed  graph  will  be  referred  to  as  a  “loop” 
in  this  chapter. 

A  directed  graph  is  strongly  connected  if  for  every  pair  of  vertices  u,v  £  V,  there 
exists  a  path  u  v  and  n  u.  A  directed  spanning  tree  is  a  subgraph  of  G  which 
has  a  root  node  vn  and  a  path  v  for  all  u  G  F  except  vr.  The  directed  spanning 
tree  contains  no  cycles.  If  |V|  is  the  number  of  nodes  in  G,  then  a  directed  spanning 
tree  contains  exactly  |V^|  nodes  and  [V^|  —  1  edges.  An  edge  of  a  directed  spanning  tree 
is  called  a  branch,  and  the  edges  of  G  not  included  in  the  tree  are  called  links.  Every 
strongly  connected  graph  contains  a  directed  spanning  tree. 

An  edge  e  from  u  to  v  {u  v)  is  incident  with  vertices  u  and  v.  More  specifically,  e 
is  incident  from  u  and  incident  into  v. 

The  set  operations  such  as  union,  intersection,  difference,  complement,  etc.,  are  op¬ 
erations  on  the  edges  of  a  graph.  Let  Ga  and  Gj  be  two  subgraphs  of  a  connected  graph 
G.  Ga  U  Gft  consists  of  all  edges  in  Ga  or  Gj,  (or  both)  and  the  vertices  incident  with 
these  edges.  G  -  Ga  is  formed  by  removing  all  edges  in  Gq  from  G,  and  then  removing 
all  vertices  with  no  incident  edges. 


2.2.2  Matrix  Representations 

A  strongly  connected  graph  contains  exactly  \E\  —  |K|  -|-  1  linearly  independent  loops 
(this  is  shown  in  Section  2.2.3).  Let  B  be  the  fundamental  loop  matrix.  This  matrix. 


which  has  dimensions  (1E|  -  |V"|  -|- 1)  x  |£;|,  is  defined  as 


^  _  I  1  if  edge  j  is  in  loop  i 


0  otherwise 

\ 

Each  row  of  B  represents  one  of  \E\  —  |V^|  +  1  linearly  independent  loops  in  B. 


Let  A  be  the  oriented  incidence  matrix  of  G.  This  matrix,  which  has  dimensions 
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|F|  X  \E\,  is  defined  as 


1  ej  is  incident  from  Vi 
dij  =  <  —1  Cj  is  incident  into  Uj 

0  Cj  and  Uj  are  not  incident 

and  rank(A)  =  |V|  -  i.  The  reduced  oriented  incidence  matrix  A/e  is  defined  to  be  any 
\V\  -  1  rows  of  A.  A/e  has  dimensions  (|V|  -  1)  x  \E\  and  rank(A/e)  =  |V|  -  1. 

Two  important  relationships  between  the  fundamental  loop  matrix  and  the  oriented 
incidence  matrix  are  BA^  =  0  and  BAj  =  0. 


Example  2.1  Consider  the  directed  graph  in  Figure  2.1.  This  graph  has  six  nodes  and 
nine  edges  (\V\  —  6  and  \E\  =  9).  The  branches  of  a  directed  spanning  tree  are  shown 
with  solid  lines  and  the  links  are  shown  with  dashed  lines.  The  spanning  tree  contains 
|V|  -  1  edges  and  \V\  nodes.  One  possibility  for  the  ((|£^|  -  |F|  +  1)  x  |£!|)  =  (4  x  9)  B 
matrix  is 

’110001000' 

0  1  1  0  0  0  1  0  0 

0  0  1  0  1  0  0  1  0  ’ 

0  0  0  1  0  0  1  0  1 

whose  columns  and  rows  appear  according  to  the  numbering  of  the  edges  and  loops,  re¬ 
spectively,  in  Figure  2.1.  A  is  the  ([Kl  x  |E|)  =  (6  x  9)  matrix 

’10000  -1  000' 
0-110010-10 
^^-11  0  1  00-10  0 

0000  -1  0010’ 

0  0-101010-1 
.000  -1  0000  1. 

The  reader  can  verify  that  rank(A)  =  |V^j  —  1  =  5  and  BA^  =  04x6-  One  possible  reduced 

incidence  matrix  is  the  ((|F|  -  1)  x  |£;[)  =  (5x9)  matrix 

’0-11  0  010-10' 

-11  0  1  00-10  0 

A/e=  0000  -1  00  1  0,  (2.2) 

0  0-10  10  1  0-1 

.0  0  0  -1  000  0  1 
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which  is  simply  A  with  the  the  first  row  (the  row  corresponding  to  node  1)  removed.  The 
reader  can  verify  that  rank(AR)  =  |V|  —  1  =  5  and  BAj  =  04x5. 


Figure  2.1;  A  strongly  connected  graph.  The  branches  of  a  spanning  tree  are  shown  with 
solid  lines,  while  the  links  of  the  corresponding  cotree  are  shown  with  dashed  lines. 

2.2.3  Finding  the  Independent  Loops  of  a  Strongly  Connected  Graph 

Recall  that  the  fundamental  loop  matrix  B  has  \E\  -  |F|  +  1  rows,  each  of  which  corre¬ 
sponds  to  an  independent  loop.  This  section  gives  an  algorithm  for  finding  \E\  —  jVI  -I- 1 
independent  loops  of  a  strongly  connected  graph.  Let  Gt  be  a  directed  spanning  tree  of 
G,  where  Vfi  is  the  root  node  of  Gt,  i.e.,  there  is  a  path  vn'^  v  for  all  u  6  V  except  vr. 

Algorithm  FFL  (Find  Fundamental  Loops)  is  given  below. 

Algorithm  FFL  (Find  FYindamental  Loops) 

FOR  (A:  =  1  TO  \E\  - \V\  +  1) 

{ 

STEP  1:  =  a  link  in  (G  —  G^^)  which  is  incident  to  G^^; 

STEP  2:  loop(A:)  =  A  loop  in  Gt  U  G^^  U  Ik  which  contains  Ik ; 

STEP  3:  =  Gif^U  loop(fc); 

} 

The  |E|  —  |F|-fl  loops  denoted  as  loop(fc),  1  <  A:  <  (|E|  — IFI-fl),  are  the  fundamental 
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loops  of  G. 


Algorithm  FFL  maintains  a  subgraph  Gn  which  initially  consists  of  the  root  node 
of  the  directed  spanning  tree  Gt-  During  iteration  k,  a  link  Ik  in  {G  -  G^^^)  which  is 
incident  into  a  node  in  is  chosen  in  STEP  1.  This  link,  along  with  edges  in  Gt^G^^\ 
form  a  loop  which  we  denote  as  loop(A:).  G^  is  then  updated  at  the  end  of  the  iteration. 

To  prove  that  Algorithm  FFL  works,  we  need  to  show  that  link  Ik  in  STEP  1  exists 
for  each  iteration  1  <  A:  <  (|E|  -  |F|  +  1),  and  we  need  to  show  that  loop(A:)  in  STEP  2 
exists  for  1  <  A;  <  (|£?|  -  \V\  +  1). 

The  following  three  lemmas  are  used  to  prove  that  link  Ik  exists  in  STEP  1  of  Algo¬ 
rithm  FFL. 

Lemma  2.1  is  strongly  connected  (SC). 

Proof:  By  induction.  G^J^  =  vr  is  SC.  Assume  that  G^^^  is  SC.  Each  vertex  in  (G^^ - 
^r)  is  part  of  loop(A;)  which  has  at  least  one  vertex  in  G^\  so  is  also  SC.  □ 

Lemma  2.2  For  every  node  v  in  G^^^  except  vr,  there  is  a  branch  of  Gt  in  G^^^  which 
is  incident  into  the  node  v. 

Proof:  By  induction.  This  holds  for  G^f^  Assume  this  holds  for  G^^\  All  edges  of 
loop(A:)  are  in  Gt  U  G^J^  U  /.  Since  =  G^J^U  loop(A;),  all  edges  in  -  G^^^) 

except  Ik  are  tree  branches.  Since  k  is  incident  into  a  node  in  G^^\  each  node  in 
but  not  in  G^^^  must  have  a  tree  branch  in  incident  into  it.  So  every  node  in 

except  VR,  has  a  tree  branch  in  incident  into  it.  □ 

The  following  lemma  uses  the  result  of  Lemma  2.2. 
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Lemma  2.3  There  are  no  branches  of  Gt  in  {G  —  which  are  incident  to  a  node 
in  G^^\ 

Proof:  By  contradiction.  Assume  a  branch  exists  in  {G  -  which  is  incident  into 
the  node  v  in  Then  v  must  have  two  incident  branches  because  we  know  from 

Lemma  2.2  that  there  is  also  a  branch  in  G^^  which  is  incident  into  v.  However,  no 
node  can  have  two  incident  branches  because  multiple  paths  v  would  exist  in  Gt, 
which  is  not  allowed.  □ 

Lemma  2.1  and  Lemma  2.3  are  used  to  prove  that  /*  exists  in  STEP  1  of  Algorithm 
FFL. 

Theorem  2.4  Link  Ik  in  STEP  1  of  Algorithm  FFL  exists  for  all  iterations  1  <  A:  < 
i\E\-\V\  +  l). 

/u\ 

Proof:  (G  -  G^’)  contains  exactly  |£;|  -  |F|  +  2  -  A;  links  at  the  start  of  iteration  k,  so 

(k) 

(G  -  Gy)  contains  at  least  one  link  during  each  iteration.  Consider  the  following  two 
cases: 

1.  There  exists  a  node  v  eV  which  is  not  in  G^^\  i.e.,  no  edges  in  G^^^  are  incident 
into  or  from  v.  Since  G  is  SC,  there  is  a  path  from  v  to  vr,  implying  that  there  is  a 
path  from  v  to  G^\  According  to  Lemma  2.3,  there  are  no  branches  in  (G  —  G^^) 
which  are  incident  to  a  node  in  G^^\  so  there  must  be  a  link  in  (G  -  gJJ^)  which 
is  incident  into  G^^  allowing  a  path  to  exist  from  v  to  G^\ 

2.  G^^  contains  all  nodes.  Each  link  in  (G  —  G^^)  is  incident  into  G^^  in  this  case. 

□ 
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The  following  theorem  uses  Lemma  2.1  to  show  that  loop(A:)  in  STEP  2  exists  for 

1<A:<(|£:|-|P|  +  1). 

Theorem  2.5  There  is  a  loop  containing  4  in  Gt  U  U  Ik- 

Proof:  Consider  Figure  2.2.  Nodes  vr  and  V[n  are  in  G^^\  Link  Ik  is  in  (G  - 
Path  p2  exists  in  G^^^  because  G^^^  is  SC  (according  to  Lemma  2.1).  Path  pi  exists 
in  Gt  because  vr  is  the  root  of  the  directed  spanning  tree.  So  a  directed  cycle  vx  ^ 
vrx  vr'^  Vx  exists  in  Gp  U  G^l^^  U4-  If  this  directed  cycle  is  not  elementary,  then  it 
must  have  the  form  vx  w/at  ^common  vr  ^common  vx:  from  which  the 
elementary  directed  cycle  (loop)  vx  u//v  ^’common  vx  can  be  found.  □ 


\P| 


Figure  2.2:  A  directed  cycle  created  by  adding  link  Ik  which  goes  from  (G  -  to 
■ 

We  construct  the  fundamental  loop  matrix  B  by  letting  loop(A:)  from  Algorithm  FFL 
be  the  k-th  row  of  B.  The  edges  in  the  graph  are  numbered  such  that  the  first  (| Pj  —  1) 
columns  of  B  correspond  to  the  branches  of  the  spanning  tree  of  G,  and  the  remaining 
(|E|-|P|  +  1)  columns  correspond  to  the  links.  The  link  4  is  assigned  to  the  (|P|-1  +  A:)- 
th  column  of  B.  By  constructing  the  fundamental  loop  matrix  in  this  manner,  it  has  the 
form 

B  =  [  C  I  L  ] ,  (2.3) 
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where  C  is  an  (|£J|  -  |F|  + 1)  x  (|V|  —  1)  matrix  and  L  is  an  (|£^|  —  |K|  + 1)  x  (|£'|  —  |F|  + 1) 
lower  triangular  matrix  with  ones  on  the  diagonal.  Note  that  the  columns  of  L  correspond 
to  the  links  of  G  while  the  columns  of  C  correspond  to  the  branches  of  G.  Because  of 
its  form,  B  has  rank  (|E|  -  |V|  +  1). 

It  can  also  be  shown  that  adding  more  loops  of  G  to  B  (adding  a  loop  would  consist 
of  adding  a  row  to  B)  does  not  increase  its  rank.  Therefore,  the  (|£?|  -  1^1  +  1)  rows  of 
B  form  a  basis  for  the  loops  of  G. 

Example  2.2  This  example  uses  Algorithm  FFL  to  form  the  fundamental  loop  matrix 
for  the  graph  in  Figure  2.1.  The  spanning  tree  with  node  1  as  the  root  node  is  shown  in 
Figure  2.3(a).  At  the  start  of  Algorithm  FFL  G/|^  is  node  1.  During  iteration  k  =  I,  the 
only  possibility  for  link  l\  is  edge  6.  The  only  possibility  for  loop{l)  i.9  1  3  A  2  A  1. 

G^jl  is  circled  in  Figure  2.3(b).  During  iteration  k  =  2,  there  are  two  possibilities  for  link 
I2,  namely,  edges  7  and  8.  Choosing  edge  7  as  I2  results  in  loop{2)  =  3A2-^5-^3. 
G/j^  is  circled  in  Figure  2.3(c).  During  iteration  k  =  3,  the  two  possibilities  for  link  l^ 
are  edges  8  and  9.  Choosing  edge  8  as  /,)  results  in  loop{3)  =  2A5A4A2.  Gy^^ 
is  circled  in  Figure  2.3(d).  During  iteration  k  =  4,  link  L  is  edge  9,  and  loop{A)  is 

,4  9  7 

3  — >  6  — >  5  — ^  3.  The  fundamental  loop  matrix  is 


■  1 

1 

0 

0 

0 

1 

0 

0 

0  ■ 

0 

1 

1 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

1 

Note  that  B  has  the  desired  form  as  given  in  (2.3).  Row  k  corresponds  to  loop{k)  from 
Algorithm  FFL  and  column  i  corresponds  to  edge  i  of  G. 
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Figure  2.3:  Tlie  four  steps  of  Algorithm  FFL  which  finds  the  four  fundamental  loops  of 
the  graph  shown  in  Figure  2.1.  For  each  iteration  k,  the  subgraph  is  circled. 


2.3  Scheduling  and  Retiming  Formulations 


Time  scheduling  (or  simply  scheduling)  consists  of  assigning  execution  times  to  the  oper¬ 
ations  in  a  DFG  such  that  the  precedence  constraints  of  the  DFG  are  not  violated.  This 
section  considers  two  scheduling  problems,  namely,  scheduling  to  a  time-multiplexed 
bit-parallel  target  architecture  (we  call  this  bit-parallel  scheduling)  and  scheduling  to  a 
bit-serial  target  architecture  (we  call  this  bit-serial  scheduling).  It  turns  out  that  the 
bit-parallel  and  bit-serial  scheduling  formulations  are  quite  similar,  and  the  retiming 
formulation  is  a  special  case  of  bit-parallel  scheduling. 
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2.3.1  Bit-Parallel  Scheduling 


In  bit-parallel  scheduling,  a  DFG  is  statically  scheduled  to  a  bit-parallel  target  archi¬ 
tecture.  The  scheduling  formulation  presented  in  this  section  is  based  on  the  folding 
equation  developed  in  [28].  Folding  is  the  process  of  executing  several  algorithm  oper¬ 
ations  on  a  single  hardware  module.  Scheduling  is  the  process  of  determining  at  which 
time  units  a  given  algorithm  operation  is  to  be  executed  in  hardware. 

Before  the  scheduling  formulation  is  developed,  we  need  a  brief  description  of  retiming. 
The  basic  retiming  equation  for  the  edge  u  A  u  is  [27] 

Wrie)  =  w{e) +r{v)  -  r{u),  (2.4) 

where  w(e)  is  the  number  of  delays  on  the  edge  before  retiming,  Wr{e)  is  the  number  of 
delays  on  the  edge  after  retiming,  and  r{u)  and  r(u)  are  the  retiming  values  of  nodes  u 
and  V,  respectively. 

The  notions  of  an  iteration  and  an  iteration  period  are  used  in  this  section.  An 
iteration  is  defined  cis  the  execution  of  each  node  in  the  DFG  exactly  once.  The  iteration 
period  is  defined  ;is  the  number  of  clock  cycles  used  to  execute  one  iteration  of  the  DFG 
in  hardware. 

Consider  an  edge  e  from  node  n  to  node  v,  denoted  as  u  A  u.  The  operations  (nodes) 
in  the  DFG  are  scheduled  to  be  executed  in  the  folded  architecture  once  every  N  clock 
cycles,  where  N  is  the  iteration  period.  Let  the  l-th  iteration  of  nodes  u  and  v  be 
executed  in  hardware  at  time  units  Nl  +  p{u)  and  Nl  +p{v),  respectively,  where  p{u) 
and  p[v)  are  the  time  partitions  to  which  the  nodes  are  scheduled  to  execute  such  that 
0  <  p{u),p{v)  <  N  —  1.  Let  edge  e  have  Wr{e)  delays,  which  means  that  the  result  of  the 
Z-th  iteration  of  node  u  is  used  by  the  (Z  -|-  «;r(e))-th  iteration  of  node  v.  The  hardware 
modules  which  execute  nodes  u  and  v  are  denoted  as  Hu  and  Hy,  respectively.  If  Hu  is 
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pipelined  by  d{u)  stages,  then  the  result  of  the  l-th  iteration  of  node  u  is  available  at 
Nl  +  p{u)  +  d{u).  This  sample  is  used  by  the  {I  +  u;r(e))-th  iteration  of  node  v,  which  is 
executed  by  at  N{1  +  Wr{e))  +  p{v),  so  the  sample  must  be  stored  for 

/(e)  =  N{1  +  ?i;r(e))  +  p{v)  -  {Nl  +p{u)  +d{u))  =  Nwr{e)  -  d{u)  +p{v)  -  p{u) 

clock  cycles.  Substituting  for  Wrie)  using  (2.4)  gives 

/(e)  =  Nw{e)  —  d{u)  —  N{r(u)  -  r(u))  -  {p{u)  -  p{v)).  (2.5) 

The  edge  u  A-  v  with  vj{e)  delays  in  the  DFG  maps  to  an  edge  from  Hu  to  Hy  with  /(e) 
delays  in  the  architecture,  and  the  data  on  this  edge  are  switched  into  Hy  at  time  units 
Nl  +  p{v). 

Note  that  we  fissumc  that  the  hardware  module  Hy  is  pipelined  by  d{u)  delays,  where 
d{u)  is  the  computation  time  of  the  node  u  in  the  DFG.  If  we  define  an  |£^|  x  1  vector 
d„  whose  i-th  element  is  the  computation  time  of  the  source  node  of  edge  i  (the  source 
node  of  an  edge  is  the  node  that  the  edge  is  incident  from),  then  the  folding  equation 
can  be  written  for  all  IjFI  edges  of  the  DFG  simultaneously  tising 

f  =  ATw  -  d„  -  A^(p  +  yVr),  (2.6) 

where  A  is  the  |F|  x  \E\  incidence  matrix  for  the  graph  G  (see  Section  2.2.2),  p  is  the 
I  FI  X  1  time  partition  vector  •which  assigns  node  i  to  the  time  partition  (0  <  pi  <  A  — 1), 
r  is  the  |F|  x  1  retiming  vector  with  the  retiming  values  of  the  nodes  in  G,  w  is  \E\  x  1 
and  contains  the  number  of  delays  on  each  edge  of  G,  f  is  the  |FJ|  x  1  folding  vector 
which  contains  the  number  of  delays  on  each  edge  of  the  folded  architecture,  and  d^  is 
the  |£^|  X  1  delay  vector  as  previously  described.  This  formulation  of  folding  is  general 
because  it  relies  upon  the  retiming  solution  r  and  the  time  partition  vector  p.  One  way 
to  view  this  is  that  the  DFG  is  preprocessed  using  retiming  (hence  the  r  vector)  and 
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then  scheduling  is  perfomed  on  the  retimed  DFG  (hence  the  p  vector).  Combining  r  and 
p  using  s  =  p  +  Nr  results  in  the  schedule  vector  s.  Using  s,  the  scheduling  problem  can 
be  written  as 

A^s  =  Nvr-  d„  -  f.  (2.7) 


The  rank  of  the  | Ul  x  |£^|  incidence  matrix  A  is  |U|  —  1.  Therefore,  the  left  nullspace  of 
A  must  consist  of  a  vector  x  which  satisfies  A^x  =  0|£;|xi.  We  can  see  that  x  =  Ip/jx! 
because  each  column  of  A  contains  exactly  one  entry  which  is  a  1,  one  entry  which  is  a 
—  1,  and  the  remaining  entries  of  the  column  are  zero. 

Using  =  0|E|xi  we  can  write 

A^(s  +  A,T)  =  Nw  —  du  —  f , 


which  means  that  adding  the  constant  k  to  each  element  of  the  schedule  vector  does  not 
change  the  number  of  delays  on  the  edges  of  the  folded  architecture. 


The  incidence  matrix  A  can  be  written  as 

A  =  [^  ai  a2  •  •  •  ap/j  j 


The  reduced  incidence  matrix  consists  of  any  [U]  —  1  rows  of  A.  Removing  row  m  of  A 
results  in 


Ar  =  ai  a2  •••  a^-i  a,„+i 


(2.8) 


The  reduced  incidence  matrix  Ar  has  dimensions  (jUI  —  1)  x  \E\  and  rank  |U|  —  1.  The 
reduced  scheduling  vector  is  defined  as 


Sr  = 


•Sl  S2 


^m— 1  ^m+l 


(2.9) 


which  can  be  written  as  sr  =  pR  +  Nrji,  where  pr  and  fr  are  the  time  partition  vector 
p  and  the  retiming  vector  r  with  the  m-th  elements  removed.  Using  Ar  and  sr,  we  can 
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write 


A^s  =  siTn)a.m  +  A^sr. 

Substituting  this  into  (2.7)  results  in 

A^Sfi  =  ATw  -  du  -  f  -  s(m)am.  (2.10) 

Node  m  is  called  the  reference  node.  Since  replacing  s  by  s'  =  s  +  kl  does  not  alter 
the  resulting  folded  architecture,  we  can  choose  k  —  -s(m)  so  s'(m)  =  0.  After  replacing 
s  with  s'  =  s  —  A'(m)l,  (2.10)  becomes  A^s'/j  =  Nw  —  du  —  f . 

Throughout  the  remainder  of  this  chapter,  we  will  assume  that  s'  =  s  -  s(m)l  so 
s'{m)  =  0.  In  an  abuse  of  notation,  we  will  refer  to  s'  simply  as  s  so  that  (2.7)  can  be 
written  as 

ArSr  =  Nw  -  du  -  {.  (2.11) 

Lemma  2.6  The  equation  (2.11)  can  be  solved  forsR  if  and  only  z/B(Aw-d„)  =  Bf. 


Proof:  The  equation  (2.11)  has  a  solution  if  and  only  if  Nw  —  d,j  —  f  is  in  the  IV"!  —  1 
dimensional  row  space  of  A/f.  Equivalently,  (2.11)  has  a  solution  if  and  only  if  Nw—du—f 
is  perpendicular  to  the  lEJI  -  [ V|  + 1  dimensional  nullspace  of  Ar  because  the  nullspace  is 
the  orthogonal  complement  of  the  row  space  in  Since  BA^  =  0  (see  Section  2.2.2), 
the  l^l  -  |F|  +  1  rows  of  the  fundamental  loop  matrix  B  form  a  basis  for  the  nullspace 
of  Ar.  Therefore,  (2.11)  has  a  solution  if  and  only  if  B(Aw  —  d^  -  f)  =  0.  □ 

To  understand  the  meaning  of  B(iVw  —  d^  —  f)  =  0,  we  begin  by  writing  B  as 


B  = 


bi  b2 


such  that  bf  is  the  i-th  row  of  B.  Using  this,  B{Nw  -  d^  -  f)  =  0  implies  bff  = 
bf  {Nw  -  du).  Recall  that  bij  =  1  if  edge  j  is  in  loop  i  and  bij  =  0  otherwise.  Therefore, 
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hji  is  the  total  number  of  folded  delays  on  loop  i,  and  hj (iVw  -  d„)  is  a  constant  that 
depends  on  G.  The  equation  bff  =  bf{iVw  -  du)  states  that  the  number  of  folded 
delays  on  loop  i  is  the  same  for  any  legal  folding  vector  f,  and  B(A^w  —  du  —  f)  =  0 
implies  that  this  is  true  for  all  \E\  —  IV]  +  1  independent  loops  of  G  represented  by 
the  rows  of  B.  Furthermore,  the  sum  of  the  number  of  folded  delays  for  all  edges  and 
pipelining  delays  associated  with  all  nodes  of  a  loop  is  the  product  of  the  folding  factor, 
N,  and  the  number  of  loop  delay  elements,  as  noted  in  [28].  It  can  also  be  shown  that 
this  holds  for  the  dependent  loops  of  G,  i.e.,  the  number  of  folded  delays  on  each  loop 
of  G  that  is  not  represented  by  a  row  of  B  is  the  same  for  any  legal  folding  vector  f. 

If  B(IVw  -  du)  =  Bf  holds,  (2.11)  has  exactly  one  solution  for  sr,  which  is  given  by 

Sfi  =  (AftAj^)~‘Aft(Afw  -  du  -  f).  (2.12) 

The  above  discussion  can  be  summarized  by  saying  that  the  number  of  folded  delays  on 
each  loop  in  G  is  the  same  for  any  valid  schedule  s. 

In  addition  to  the  condition  B(iVw  —  du)  =  Bf  there  is  also  the  practical  condition 
that  the  number  of  delays  on  an  edge  in  the  folded  architecture  must  be  nonnegative. 
This  condition  can  be  written  as  f  >  0.  The  constraints  for  a  valid  schedule  are 

1.  B(Arw-du)  =Bf 

2.  f  >  0. 

2.3.2  Retiming 

Retiming  is  the  process  of  moving  delays  around  in  a  circuit  without  changing  the  func¬ 
tionality  of  the  circuit  [27].  A  brief  description  of  retiming  is  given  at  the  beginning  of 
Section  2.3.1.  This  section  describes  how  retiming  can  be  viewed  as  a  special  case  of 
bit-parallel  scheduling. 
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The  folding  equation  for  a  graph  G  is  given  in  (2.6).  If  each  node  in  G  represents  a 
hardware  operator,  then  all  operations  in  the  graph  are  executed  in  a  single  clock  cycle 
resulting  in  an  iteration  period  of  AT  =  1.  The  elements  of  the  time  partition  vector  p 
are  all  zero  because  time  partition  zero  is  the  only  available  partition.  If  we  let  =  0, 
i.e.,  we  do  not  consider  any  internal  pipelining  of  the  operators,  (2.6)  becomes 

f  =  (l)w-O- A^(0  +  Ir) 

which  simplifies  to 

f  =  w  -  A'^r.  (2.13) 

Since  f  is  the  number  of  delays  in  the  folded  architecture,  f  is  equivalent  to  for  A"  =  1, 
so  (2.13)  becomes 

Wr  =  w  -  A'^r,  (2.14) 

which  is  simply  the  matrix  notation  for  writing  (2.4)  simultaneously  for  all  edges  of  the 
graph.  This  demonstrates  that  retiming  is  simply  scheduling  when  the  iteration  period 
is  unity. 

Using  A^ln/|>^[  =  (2.14)  can  be  written  as 

A^(r  +  kl)  —  w  —  Wr. 

If  r  is  a  retiming  vector  which  maps  the  graph  G  to  the  retimed  graph  Gr,  then  so  is 
(r  +  A;l)  for  any  integer  k. 

In  the  context  of  retiming  (i.e.,  assuming  A  =  1,  p  =  0,  =  0,  and  f  =  Wr),  (2.11) 

can  be  written  as 

AflrH  =  w-Wr.  (2.15) 

Recall  that  (2.11)  assumes  that  s(m)  =  0.  Since  s  =  Ar  +  p  and  p  =  0  is  assumed  to 
obtain  (2.15),  this  implies  that  r(m)  =  0  in  (2.15).  In  other  words,  the  retiming  value 
of  the  reference  node  is  0  in  this  formulation. 
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The  translation  of  Lemma  2.6  to  the  retiming  context  is  that  (2.15)  has  a  solution  if 
and  only  if  Bw  =  Bwr  holds.  This  implies  that  the  number  of  delays  on  any  loop  in  G 
remains  unchanged  during  retiming,  as  noted  in  [27].  If  Bw  =  Bw^  holds,  (2.15)  has 
exactly  one  solution  for  r/i,  which  is  given  by 

Tft  =  (AfiAK)“*Aft(w  -  Wr).  (2.16) 

In  addition  to  the  condition  Bw  =  Bwr,  there  is  also  the  practical  condition  that  the 
number  of  delays  on  an  edge  in  the  retimed  graph  must  be  nonnegative.  This  condition 
can  be  written  as  w^  >  0.  The  condition  for  a  valid  retiming  from  G  to  Gr  are 

1.  Bw  =  Bwr 

2.  Wr  >  0. 

2.3.3  Bit-Serial  Scheduling 

In  this  section,  a  scheduling  formulation  is  developed  where  the  target  architecture  is  a 
bit-serial  architecture.  This  formulation,  which  is  similar  to  the  formulation  in  Chap¬ 
ter  6  of  [50],  has  the  same  general  form  as  the  retiming  and  the  bit-parallel  scheduling 
formulations  in  Sections  2.3.1  and  2.3.2. 

A  bit-serial  operator  is  often  represented  using  a  timing  diagram  such  as  the  one  in 
Figure  2.4.  Let  the  execution  of  operator  A  in  this  figure  begin  at  time  T^.  The  first 
bit  of  each  of  the  inputs  xi,  X2,  and  X3  arrives  at  time  units  Ta  +  t(3:i),  Ta  +  t{x2),  and 
Ta  +  ^(2:3),  respectively.  The  first  bit  of  each  of  the  outputs  yi  and  y2  is  produced  at 
time  units  Ta  +  t(yi)  and  Ta  +  ^(2/2),  respectively.  In  other  words,  the  timing  diagram 
gives  the  relative  differences  between  the  timing  of  the  input  and  output  samples  of  the 
operator. 
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X 


X2 

X3 

Figure  2.4:  The  timing  diagram  for  the  bit-serial  operator  A. 

Example  2.3  For  the  bit-serial  adder  in  Figure  2.5(a)  which  computes  F  =  A  +  B,  the 
timing  diagram  is  shown  in  Figure  2.5(b).  Note  that  W  is  the  wordlength. 


(a)  (b) 


Figure  2.5:  (a)  The  architecture  for  a  bit-serial  adder  for  wordlength  of  W .  (b)  The 
timing  diagram  for  this  architecture. 

The  constraints  for  the  bit-serial  scheduling  problem  can  be  derived  using  the  timing 
diagram.  Consider  the  edge  u  u  with  Wr{e)  delays  in  Figure  2.6.  The  output  of 
iteration  I  of  u  is  used  as  the  input  of  iteration  I  -I-  Wr{e)  of  v.  Let  the  Lth  iteration 
of  nodes  u  and  v  begin  execution  at  time  units  Wl  +  p{u)  and  Wl  -\-p{v),  respectively, 
where  W  is  the  data  wordlength  and  p{u)  and  p{v)  are  the  time  partitions  to  which  the 
nodes  are  scheduled  to  execute  such  that  0  <  p{u),p{v)  <  VF  -  1.  The  output  of  the  l-th 
iteration  of  u  is  available  at  Wl  +p{u)  -1-  t{u)  and  the  output  of  the  I  +  Wr{e)-th.  iteration 
of  V  is  consumed  at  W{1  +  Wr{e))  -1-  p{v)  -|-  t{v),  so  the  result  must  be  stored  for 

6(e)  =  W  {I +Wr{e))  -\-p{v)  -i-t{v)  —  [Wl  -\-p{u)  -\-t{u)]  =  WTVr{e)  —  {t{u)  —  t{v))  -\-p{v)  —p{u) 

clock  cycles. 
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Figure  2.6:  An  edge  n  A  t;  with  Wr{e)  delays. 

This  equation  can  be  written  for  all  |£1|  edges  of  the  graph  simultaneously  according 
to 

b  =  Wwr  -  (t„  -  t„)  -  A^p,  (2.17) 

where 

•  A  is  the  incidence  matrix  for  the  graph. 

•  p  is  the  time  partition  vector  which  assigns  node  i  to  the  time  partition  pi  where 
0  <  Pt  <  W"  -  1. 

•  tu  is  defined  such  that  t,,,.  is  the  value  t{-)  at  the  source  of  edge  i  in  the  graph. 

•  tj,  is  defined  such  that  is  the  value  <(■)  at  the  sink  of  edge  i  in  the  graph. 

•  Wr  contains  the  number  of  delays  on  each  edge  of  the  retimed  DFG. 

•  b  contains  the  number  of  serial  delays  on  each  edge  of  the  hardware  implementa¬ 
tion. 

The  bit-serial  folding  equation  (2.17)  operates  on  the  retimed  DFG  Gr-  Substituting 
(2.14)  into  (2.17)  results  in 

b  =  kFw  -  (t„  -  t„)  -  A^(p  +  IFr). 

Combining  r  and  p  using  s  =  p  -f  IFr  results  in 

A^s  =  PFw  —  (t„  —  t„)  -  b. 
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This  equation  can  be  rewritten  as 


^R^fi  —  Ww  (tji  tu)  —  b,  (2.18) 

where  An  and  sr  are  defined  as  in  (2.8)  and  (2.9),  and  the  scheduling  value  for  the 
reference  node  is  s{m)  =  0. 

Using  the  same  argument  as  in  Lemma  2.6,  it  can  be  shown  that  the  bit-serial  schedul¬ 
ing  equation  (2.18)  has  a  solution  if  and  only  if  B(lTw  -  (!„  -  t„))  =  Bb.  The  equation 
B(lTw  —  (t,j  —  t,,))  =  Bb  states  that  the  sum  of  the  serial  delays  in  any  loop  of  the 
hardware  implementation  is  the  same  for  any  valid  serial  delay  vector  b.  In  addition, 
the  sum  of  the  number  of  serial  delay  elements  of  all  edges  and  latencies  associated  with 
all  nodes  in  a  loop  is  the  same  as  the  product  of  the  word-length  and  the  number  of  loop 
delay  elements. 

A  second  constraint,  b  >  0,  exists  because  a  connection  in  hardware  cannot  have  a 
negative  number  of  delays.  The  constraints  for  a  valid  bit-serial  schedule  are 

1.  B(iyw  -  (tu  -  t,,))  =  Bb 

2.  b  >  0 

The  value  of  the  schedule  vector  s  can  be  found  using 

Sr  =  (A/jAn)“‘Afl(lTw  -  (tu  -  t„)  -  b).  (2.19) 

2.4  Generating  All  Scheduling  and  Retiming  Solutions 

2.4.1  Generating  All  Bit-Parallel  Scheduling  Solutions 

Based  on  the  two  constraints  B(Aw  — d^)  =  Bf  and  f  >  0,  all  scheduling  solutions  for  a 
strongly  connected  DFG  can  be  generated.  A  systematic  technique  for  generating  these 
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solutions  is  presented  in  this  section. 


Recall  that  B  is  the  fundamental  loop  matrix  which  can  be  expressed  as  B  = 
C  I  L  ],  where  C  is  an  (|£;|  -  \V\  +  1)  x  (|f^|  -  1)  matrix  and  L  is  an  {\E\  - 
|V|  +  l)x|E|-|V^|  +  l)  lower  triangular  matrix  with  ones  on  the  diagonal.  The  columns 
of  C  correspond  to  the  branches  of  the  spanning  tree  of  G  which  is  chosen  before  Algo¬ 
rithm  FFL  is  used  to  find  B,  and  the  columns  of  L  correspond  to  the  links  of  G.  The 
rows  of  B  correspond  to  (|E|  —  1^1  -f  1)  linearly  independent  loops  in  G. 


The  algorithm  for  generating  all  scheduling  solutions  requires  an  interval  to  be  written 
for  the  folded  weight  of  each  branch  of  G  and  an  equality  to  be  written  for  the  folded 
weight  of  each  link  of  G.  The  interval  for  the  folded  weight  of  a  branch  gives  the  range  of 
possible  values  for  the  number  of  folded  delays  for  this  branch  in  the  folded  architecture. 
The  equality  for  the  folded  weight  of  a  link  gives  an  expression  for  the  number  of  delays 
for  the  link  in  the  folded  architecture.  Using  these  intervals  and  equalities,  code  can  be 
constructed  to  generate  all  possible  scheduling  solutions. 


To  determine  these  intervals  and  equalities,  the  elements  of  the  fundamental  loop 
matrix  are  examined  one-by-one  in  a  row-by-row  manner,  starting  at  the  top-left  of  the 
matrix.  Each  time  a  “1”  is  encountered  in  the  C  submatrix  of  B  such  that  this  “1”  is  the 
first  “1”  encountered  in  its  column,  an  interval  is  specified  for  this  branch.  This  interval, 
which  represents  the  range  for  the  number  of  folded  delays  for  the  branch  in  the  folded 
architecture,  takes  into  account  the  intervals  and  equalities  previously  determined  in  the 
row- by- row  scan  of  B. 


Assume  that  the  first  “1”  in  column  n  of  C  is  in  row  m,  i.e.,  bmn  =  1  and  bin  =  0  for  all 
I  <  m.  Let  hj  denote  any  row  of  B  such  that  b^n  =  1)  i-e.,  loop(/i;)  is  a  fundamental  loop 
that  contains  the  edge  n.  Since  6„„  is  the  first  “1”  in  column  n,  m  <  A:  <  |£;|  -  |F|  -h  1 
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must  hold,  i.e.,  6fc„  is  in  row  m  or  in  a  row  which  is  below  row  m.  From  Bf  =  B(iVw— du), 
we  get 

bfcf  =  b^(7Vw-d„)  ^  =  b[(iVw-d„)  =>  /„+  hjfj  =  hl{Nw-du). 

jeE-{n} 

(2.20) 

Let  D  denote  the  set  of  edges  encountered  before  reaching  the  element  bmn  in  the  row- 
by-row  scan  of  B.  Mathematically,  D  is  the  set  of  edges  j  such  that  there  exists  an 
element  bij  =  1  such  that  j  +  (|FJ|  —  l)i  <  n  +  (|£J|  —  l)m.  Using  £),  we  can  rewrite  (2.20) 
as 

fn  +  J2  +  IZ  ^kjfj  =  hJiNw  -  d„).  (2.21) 

jeo  j^E-D-{n} 

The  intervals  and  equalities  for  the  edges  in  the  set  E  —  D  —  {n}  have  not  yet  been 
determined;  however,  we  do  know  from  f  >  0  that  T.jeE-D-{n}  hjfj  >  0.  Using  this  in 
(2.21)  results  in 

/n  +  IZ  ^kjfj  <  bJiNw  -  du). 

j&D 

Using  this  along  with  f  >  0  specifies  the  interval  for  /„ 

0  <  /n  <  b^(Afw  -  d„)  -  ^  bkjfj,  (2.22) 

jeD 

which  must  hold  for  all  k  such  that  b^n  —  'i- 

Because  the  matrix  L  in  B  =  C  |  L  j  is  lower  triangular  with  ones  on  the 
diagonal,  the  diagonal  element  of  row  m,  Imm,  is  always  the  first  “1”  encountered  in 
column  m  of  L  during  the  row-by-row  scan  of  B.  In  addition  to  using  Imm  to  denote  this 
element,  it  can  also  be  denoted  as  bmn  where  n  =  |U|  —  1  -l-m.  When  bmn  is  encountered 
in  the  row-by-row  scan  of  B  such  that  n  =  |U|  -  1  -t-  m,  an  equality  is  written  for  /„ 
based  on  the  equation  b^f  =  b^(iVw  —  d„).  This  equality,  which  uses  the  fact  that 
the  intervals  and  equalities  have  already  been  determined  for  all  edges  in  loop(m)  except 


30 


edge  n,  is 


fn  =  b^(iVw  -  d„)  -  bmjfj.  (2.23) 

jeD 

To  summarize  the  above  discussion,  the  matrix  B  is  scanned  in  a  row-by-row  manner 
starting  with  6i  j.  When  bmn  =  1  is  encountered,  if  b^n  is  the  first  “1”  in  its  column 
of  C,  the  interval  in  (2.22)  is  written  for  all  k  such  that  bkn  =  1-  When  bmn  =  1  is 
encountered  where  n  =  |V"|  -  l-f  m,  the  equality  in  (2.23)  is  written. 

The  intervals  for  the  \V\  -  1  branches  of  G  are  denoted  as  Ij  for  1  <  ;  <  |V|  -  1. 
An  algorithm  for  writing  these  |l^|  -  1  intervals  for  the  branches  and  the  |£;|  -  |1/|  -h  1 
equalities  for  the  links  is  given  below.  At  any  point  in  this  algorithm,  D  is  the  set  of 
edges  in  G  whose  intervals  or  equalities  have  previously  been  determined. 

Algorithm  IE  (Intervals  and  Equalities) 


D  =  {}; 

FOR  (m  =  1  TO  |E|  -  |R|  +  1) 

{ 

FOR  (n  =  1  TO  lEI  -  1) 

{ 

IF  (bmn  =  1  AND  bkn  =  0  Vk  <  m) 

{ 

IF  (1  <  n  <  11/|  -  1) 

{ 

In  =  [0,  min  {CT(m,n),CT(Tn  -I-  l,n),a{\E\  -  \V\  +  l,n)}]; 
D  <r-  D  +  {n}; 

} 

ELSE 

{ 

fn  b,n(A^W  —  du)  —  Ylj^D^nijfj'i 
D  <-  D  +  {n}; 

} 

} 

} 

} 
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where 


(j(^f~,n)  =  I  YljeD^kjfj  iffcfc  —  1 

’  I  oo  otherwise 

FVom  the  intervals  and  equalities,  code  can  be  written  to  enumerate  all  possible 
scheduling  solutions.  The  general  structure  of  the  code  is: 

1.  Write  FOR  loops  for  the  intervals  and  write  assignment  statements  for  the  equali¬ 
ties  in  the  same  order  that  these  intervals  and  equalities  are  generated  in  Algorithm 
IE. 

2.  Test  the  link  weights  for  non-negativity.  If  the  link  weights  pass  this  test,  the  edge 
weights  represent  a  valid  scheduling  solution. 

This  technique  generates  all  possible  scheduling  solutions  because  the  FOR  loop  for 
branch  rn  assigns  fm  every  integer  value  which  is  legal  under  the  constraints  Bf  = 
BiN-w  -  du)  and  f  >  0,  while  taking  into  consideration  the  values  of  fi  which  are 
already  contained  in  a  FOR  loop  or  an  assignment  statement. 

Example  2.4  In  this  example,  we  find  all  scheduling  solutions  for  the  DFG  in  Figure  2.7 
assuming  an  iteration  period  of  4  and  assuming  that  the  computation  time  for  each  node 
is  unity. 


Arw-du=[-l  3  -1  -1  -1  -1-13  3 
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Figure  2.7;  The  data-flow  graph  used  in  Example  2.5. 


and 


Bf  =  B(7Vw  -  du)  => 


1  1  0  0  0  1  0  0 
0  1  1  0  0  0  1  0 
0  0  1  0  1  0  0  1 
0  0  0  1  0  0  1  0 


r  /(i)  1 

/(2) 

1 

/(3) 

■  1  ■ 
1 

1 

1 

/(4) 

/(5) 

= 

/(6) 

-1 

/(7) 

/(8) 

L  /(9)  . 


Using  Algorithm  IE  gives  the  intervals  and  equalities 


11  =  [0,1] 

=  [0, 1  -  /l] 

/e  =  1  -  /i  —  /2 

13  =  [0,l-h] 
/t  =  1  -  /?  -  /a 
^5  =  [0, 1  -  h] 
/s  =  1  —  /a  -  /s 

14  =  [0, 1  -  M 
h  =  1  -  fi  -  fr 


D  =  {1} 
D={1,2} 

£»  =  {!, 2, 6} 

£>  =  {1,2,3,6} 

T>  =  {1,2,3,6,7} 

D  =  {1,2, 3, 5, 6, 7} 

D  =  (1,2, 3, 5, 6, 7, 8} 
=  {1,2,3,4,5,6,7,81 
D  =  E. 


The  code  for  finding  all  scheduling  solutions  is 


for  (fl  =  0;  fl  <=  1;  fl++) 
for  (f2  =  0;  f2  <=  1  -  fl;  f2++) 

{ 

f6  =  1  -  fl  -  f2; 
for  (f3  =  0;  f3  <=  1  -  f2;  f3++) 
{ 

f7  =  1  -  f2  -  f3; 
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Table  2.1:  The  twelve  valid  scheduling  solutions  for  the  DFG  in  Figure  2.7. 


for  (f5  =  0;  f5  <=  1  -  f3;  f5++) 

{ 

f8  =  1  -  f3  -  fS; 

for  (f4  =  0;  f4  <=  1  -  f7;  f4++) 

{ 

f9  =  1  -  f4  -  f7; 

if  (f6  >=  0  AND  f7  >=  0  AND  f8  >=  0  AND  f9  >=  0) 

print  the  values  of  fl  through  f9  and  si  through  s6 

} 

} 

} 


There  are  twelve  scheduling  solutions  for  this  DFG.  The  scheduling  vector  sa  can  be 
computed  from  the  folded  edge  vector  {  using  (2.12).  Using  node  1  as  the  reference  node, 
the  folded  edge  weights  and  the  scheduling  values  for  the  nodes  are  listed  in  Table  2.1. 

Once  all  possible  f  vectors  have  been  found  and  the  corresponding  s  vectors  have  been 
computed  using  (2.12),  the  r  and  p  vectors  can  be  found  from  s  (recall  that  s  =  p  +  Nr) 
using  r  =  and  p  =  s  —  Nr.  It  can  be  shown  that  these  expressions  for  r  and  p  result 
in 
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•  0  <  p  <  N  —  1.  This  means  that  pi  is  indeed  a  time  partition  satisfying  0  <  pi  < 
N-l. 

•  Wr  >  0  and  Bw  =  Bwr-  This  means  that  r  is  a  valid  retiming  solution  of  G. 

To  summarize,  the  following  four  steps  can  be  used  to  find  all  valid  schedules  for  a 
strongly  connected  DFG: 

1.  Find  all  vectors  f  such  that  f  >  0  and  Bf  =  B(A^w  —  du). 

2.  Compute  s  using  (2.12)  and  s(m)  =  0,  where  m  is  the  reference  node. 

3.  r=[fj\. 

4.  p  =  s  —  Nr. 

These  four  steps  give  the  valid  schedules  for  G.  The  retiming  vector  r  corresponds 
to  a  valid  retiming  solution  for  G,  and  the  elements  of  the  partition  vector  p  satisfy 
0<Pi<N-  1. 

For  each  legal  folding  vector  f,  the  technique  in  this  section  finds  exactly  one  schedule 
s,  which  contains  information  about  the  time  partitions  p  and  the  retiming  values  r  of 
the  nodes.  However,  there  are  actually  N  schedules  which  map  the  DFG  to  a  folded 
architecture  which  has  f  delays  on  its  edges.  We  call  these  N  solutions  equivalent  sched¬ 
ules,  and  we  call  the  solution  found  using  Step  2  above  the  fundamental  schedule  s  of  the 
folding  vector  f.  The  N  equivalent  schedules  are  s  +  fcl  forO<fc<Ar  —  1.  Replacing 
s  by  s  +  /cl  has  two  effects.  First,  the  switching  instance  iV/  +  j  (0  <  y  <  AT  —  1)  in 
the  folded  architecture  becomes  Nl  +  ((j  +  k)modN).  Second,  if  scheduling  is  viewed 
as  preprocessing  the  DFG  by  retiming  (finding  r)  and  then  assigning  time  partitions 
(finding  p),  the  preprocessed  DFG  may  change  because  r  may  change.  A  nice  property 
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of  the  technique  presented  in  this  section  is  that  it  finds  the  fundamental  schedule  s  for 
each  folding  vector  f,  and  the  N  equivalent  schedules  are  implicitly  known  to  be  s  +  fcl 
for  0  <  k  <  N  —  1. 

2.4.2  Generating  All  Retiming  Solutions 

Since  retiming  is  a  special  case  of  scheduling,  the  techniques  in  Section  2.4.1  for  gen¬ 
erating  all  scheduling  solutions  can  also  be  used  to  generate  all  retiming  solutions  by 
replacing  f  with  Wr  and  letting  A  =  1  and  d„  =  0. 


Example  2.5  In  this  example,  we  generate  the  edge  intervals  and  equalities  for  the  graph 
in  Figure  2.7.  The  fundamental  loop  matrix  for  this  graph  is  given  in  (2.1),  the  weight 
vector  is 

w=[0  100000  1  1  1^, 

r  -iT 

and  Bw  =1111  .  The  intervals  and  equalities  are  generated  in  the  following 

order  u.sing  Algorithm  IE. 

11  =  [0,1]  ^  =  {1} 

12  =  [0, 1  -  Wr,]  D  =  {1,2} 

Wro  =  1  -  Wr,  -  Wr2  D  =  {1,2,Q} 

13  =  [0,l-Wr3]  P={1,2,3,6} 

Wrr  =  1  -  Wr2  -  Wrj  Z?  =  {1,  2,  3,  6,  7} 

I5  =  [0,l-Wr3]  £>  =  {1,2,3,5,6,7} 

Wrg  =  1  —  Wrg  —  Wrg  D  =  {1,  2,  3,  5,  6,  7,  8} 

14  =  [0,l-Wrr]  £>  =  {1,2,3,4,5,6,7,81 

Wrg  =  1  -  Wr4  -  Wr^  D  =  E 

Using  these  intervals  and  equalities,  the  code  which  generates  all  retiming  solutions 
for  the  DFG  in  Figure  2. 7  is  given  below.  Note  that  xi  is  used  to  represent  Wr,- . 

for  (xl  =  0;  xl  <=  1;  xl++) 
for  (x2  =  0;  x2  <=  1  -  xl;  x2++) 

{ 

x6  =  1  -  xl  -  x2; 

for  (x3  =  0;  x3  <=  1  -  x2;  x3++) 
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x7  =  1  -  x2  -  x3; 

for  (x5  =  0;  x5  <=  1  -  x3;  x5++) 

{ 

x8  =  1  -  x3  -  x5; 

for  (x4  =  0;  x4  <=  1  -  x7;  x4++) 

{ 

x9  =  1  -  x4  -  x7; 

if  (x6  >=  0  AND  x7  >=  0  AND  x8  >=  0  AND  x9  >=  0) 

print  the  values  of  xl  through  x9  cind  rl  through  r6 

} 

} 

} 

} 

There  are  twelve  retiming  solutions  for  the  DFG.  The  retiming  vector  r  is  computed 
from  the  retimed  weight  vector  using  (2.16)  and  r(l)  =  0,  where  node  1  is  the  refer¬ 
ence  node.  The  retimed  edge  weights  and  the  retiming  values  for  the  nodes  are  listed  in 
Table  2.2. 


If  a  DFG  is  not  strongly  connected,  it  is  possible  to  add  edges  to  the  DFG  to  make  it 
strongly  connected  so  all  retiming  solutions  can  be  generated.  Consider  the  biquad  filter 
in  Figure  2.8(a).  This  graph  is  not  strongly  connected  because,  for  example,  there  is  no 
path  from  the  output  node  to  the  input  node.  To  make  this  graph  strongly  connected, 
it  can  be  modified  by  adding  an  edge  from  the  output  node  to  the  input  node  as  shown 
in  Figure  2.8(b).  The  modified  graph  has  a  new  loop  IN  ->  OUT  — >  IN  which  has  one 
delay.  This  loop  forces  the  latency  of  the  DFG  to  be  one  cycle.  Using  the  techniques 
presented  in  this  section,  we  find  that  there  are  224  retiming  solutions  for  the  DFG  in 
Figure  2.8(b). 

As  another  example,  consider  the  correlator  in  Figure  2.9  which  is  used  to  demonstrate 
retiming  in  [27].  Using  the  techniques  presented  in  this  section,  143  retiming  solutions 
can  be  found  for  this  DFG.  This  result  was  also  reported  in  [30]. 
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Table  2.2:  The  twelve  valid  retiming  solutions  for  the  DFG  in  Figure  2.7. 


sol’n  # 

U)r, 

Wr^ 

tUrs 

Wre 

Wr^ 

Wrs 

Wrg 

1 

0 

0 

0 

0 

0 
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1 

0 

2 

0 

0 

0 

0 

1 

1 

1 

0 

0 

3 

0 

0 

1 

0 

0 

1 

0 

0 

1 

4 
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0 

1 

1 

0 

1 

0 

0 

0 

5 

0 

1 

0 

0 

0 

0 

0 

1 

1 

6 

0 

1 

0 

1 

0 

0 

0 

1 

0 

7 

0 

1 

0 

0 

1 

0 

0 

0 

1 

8 

0 

1 

0 

1 

1 

0 

0 

0 

0 

9 

1 

0 

0 

0 

0 

0 

1 

1 

0 

10 

1 

0 

0 

0 

1 

0 

1 

0 

0 

11 

1 

0 

1 

0 

0 

0 

0 

0 

1 

12 

1 

0 

1 

1 

0 

0 

0 

0 

0 

sol’n  # 

n 

r2 

^3 

n 

rs 

re 

1 

0 

-1 

0 

-1 

-1 

2 

0 

-1 

0 

0 

-1 

3 

0 

-1 

0 

0 

0 

4 

0 

-1 

0 

0 

0 

5 

0 

0 

0 

0 

0 

6 

0 

0 

0 

0 

0 

1 

7 

0 

0 

0 

1 

0 

0 

8 

0 

0 

0 

1 

0 

1 

9 

0 

0 
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1 
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1 

1 

0 

1 

11 

0 
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12 
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1 

2 
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Figure  2.8:  (a)  The  biquad  filter.  This  graph  is  not  strongly  connected,  (b)  A  modified 
version  of  the  biquad  filter.  This  graph  is  strongly  connected. 


Figure  2.9:  The  correlator  example  which  has  143  retiming  solutions. 


2.4.3  Bit-Serial  Scheduling 


Since  the  bit-serial  scheduling  formulation  has  the  same  form  as  the  bit-parallel  schedul¬ 
ing  formulation,  the  techniques  used  to  generate  all  bit-parallel  scheduling  solutions  can 
be  used  to  generate  all  bit-serial  scheduling  solutions  by  replacing  f  with  b  and  replacing 
Nw  —  du  with  IFw  —  (t„  -  t„). 

The  values  of  r  and  p  can  be  computed  from  s  (recall  that  s  =  p-f  Wr)  using  r  = 
and  p  =  s  —  IFr.  It  can  be  shown  that  these  expressions  for  r  and  p  result  in 


•  0<p<Ar  —  1.  This  means  that  pi  is  indeed  a  time  partition  satisfying  0  <  Pi  < 
N-l. 

•  Wr  >  0  and  Bw  =  Bwr  if  ty,  >  for  all  edges  u  A  u  as  shown  in  Figure  2.6.  This 
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means  that  r  is  a  valid  retiming  solution  of  G  when  tu  >  U  for  all  e  e  E. 

Example  2.6  In  this  example,  we  generate  all  possible  schedules  for  the  bit-serial  im¬ 
plementation  of  the  third-order  all-pole  filter  shown  in  Figure  2.10  assuming  two’s  com¬ 
plement  number  representation,  data  wordlength  is  8  (i.e.,  W  =  %),  and  coefficient 
wordlength  is  4- 


Figure  2.10:  A  third-order  all-pole  IIR  filter. 

The  first  step  is  to  determine  the  timing  diagram  for  each  operator.  The  circuit  and 
timing  diagram  for  an  adder  are  given  in  Figure  2.5.  The  circuits  and  timing  diagrams 
for  multiplication  by  —1/4,  1/8,  and  1/2  are  given  in  parts  (a),  (b),  and  (c),  respectively, 
of  Figure  2.11.  Using  these  sub-circuits,  the  timing  diagram  for  the  filter  is  shown  in 
Figure  2.12 

The  fundamental  loop  matrix  is 

'  1  0  0  1  0  1  0  0  ' 

B  =  0  10  0  1110  . 

0  0  1  0  0  1  1  1 

In  addition,  we  have 

w=[l  2300000  ]^, 
t„  =  [  1  1  1  4  3  1  1  1 
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(a) 

A 


Figure  2.11:  The  circuits  and  timing  diagrams  for  the  tliree  multipliers  in  Figure  2.10. 


and  tu  =  0.  The  equation  B(lF’w  -  (t„  —  t„))  =  Bb  is 


■  1 

0 

0 

1 

0 

1 

0 

0  ■ 

2 

0 

1 

0 

0 

1 

1 

1 

0 

b  = 

10 

0 

0 

1 

0 

0 

1 

1 

1 

20 

The  intervals  and  equalities  are 

11  =[0,2] 

14  =  [0,2-6,] 

ba  =  2  —  hi  —  64 

12  =  [0, 10  -  be] 

15  =  [0, 10  -  62  -  te] 
67  =  10  —  62  ~  ^5  ~  ^>6 

13  =  10,20-66-67] 

63  =  20  —  63  —  65  —  67 


There  are  6103  valid  scheduling  solutions.  To  avoid  examining  all  of  these  solutions, 
let  us  examine  only  those  solutions  which  use  the  minimum  number  of  serial  registers. 
The  number  of  serial  registers  is 

D  =  max(6i ,  62, 63)  +  64  +  65  +  65  +  67  +  6g. 
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Figure  2.12:  The  timing  diagram  for  the  filter  in  Figure  2.10.  The  edge  labels  are  shown 
in  parentheses  to  avoid  confusion  with  the  timing  values. 

The  minimum  number  of  registers  for  all  6103  valid  scheduling  solutions  is  Drnin  —  20, 
and  there  are  330  solutions  which  use  20  registers.  One  solution  that  uses  20  registers  is 

b  =  [0000028  10  1^ 

s  =  [0-3-12  -7  -15  -23 

r  =  [  0  -1  -2  -1  -2  -3 

p  =  [  0  5  4  1  1  1 

The  complete  architecture  for  this  solution  is  shown  in  Figure  2.13.  This  architecture 
uses  20  registers,  not  including  the  registers  which  are  internal  to  the  processing  units. 

2.5  Bit-Parallel  Scheduling  with  Resource  Constraints 

When  all  of  the  schedules  are  generated  for  a  DFG,  this  may  include  many  schedules 
which  require  more  hardware  resources  than  are  available  for  the  implementation.  In 
this  section,  we  describe  two  methods  for  finding  the  schedules  which  satisfy  a  given 
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x(n) 


Figure  2.13:  An  architecture  for  the  third-order  all-pole  filter.  This  architecture  uses  the 
minimum  number  of  registers  (20),  not  including  the  registers  which  are  internal  to  the 
processing  units. 

set  of  resource  constraints.  In  the  first  method  (the  solution-save  method)^  we  generate 
all  scheduling  solutions  and  then  save  only  the  solutions  which  satisfy  the  resource  con¬ 
straints.  In  the  second  method  (the  solution-generate  method),  we  only  generate  those 
scheduling  solutions  which  satisfy  the  resource  constraints. 

2.5.1  The  Solution-Save  Method 

The  number  of  hardware  modules  required  by  a  scheduled  DFG  can  be  determined 
from  p.  For  example,  let  be  the  number  of  multiplication  operations  scheduled  to 
time  partition  n(0<n<AI  —  1),  and  let  Cn  be  the  number  of  addition  operations 
scheduled  to  time  partition  n.  Then  the  number  of  multipliers  required  by  the  schedule 
is  m  =  maxo<„<Ar_i{mn}  and  the  number  of  adders  is  a  =  majco<n<v-i{on}' 
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Example  2.7  In  this  example  we  find  all  scheduling  solutions  which  require  1  multiplier 
and  1  adder  for  the  biquad  filter  in  Figure  2.8(b)  assuming  an  iteration  period  of  N  —  A 
and  assuming  that  addition  and  multiplication  require  1  and  2  units  of  time,  respectively. 
Nodes  1,  2,  7,  and  8  are  addition  operations  and  nodes  3,  4,  5,  and  6  are  multiplication 
operations. 

The  fundamental  loop  matrix  is 

'100000010000' 
011000001000 
B=  000001110100 
000100001010 
,000010010101 

-iT 

and  B(4w  —  <!„)=  20347  .  The  intervals  and  equalities  are 

=  [0,2] 

/8  =  2-/, 

12  =  [0,0]  ^  /2  =  0 

13  =  [0,0  -  /2]  =>  /a  =  0 

/o  =  0  -  /2  -  /a  /o  =  0 
16  =  [0, 3-/8] 

2:7  =  [0,3-/6-/8] 

/lo  =  2  -  /g  -  //  -  /s 

14  =  [0,4  -  /g] 

/ii  =  4  —  /i  —  fo 

15  =  [0,7-/8-/lo] 

fn  —  I  -  fs  -  h  -  fio 

There  is  a  total  of  625  valid  scheduling  solutions  for  this  example;  however,  only  6  of 
these  solutions  use  only  1  adder  and  1  multiplier.  Tables  2.3  and  2.4  give  the  details  of 
these  solutions,  and  the  DFGs  for  these  six  solutions  are  given  in  Figure  2.14- 


Example  2.8  Consider  the  4-stage  pipelined  8-th  order  all-pole  lattice  filter  in  Fig¬ 
ure  2.15.  Edge  11  has  been  added  to  this  filter  to  make  it  strongly  connected.  For 
the  iteration  period  N  =  2,  this  filter  has  450  scheduling  solutions,  and  99  of  these 
schedules  use  2  adders  and  2  multipliers.  Of  these  99  schedules,  the  minimum  possi- 
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Table  2.3:  The  f  and  s  values  for  the  six  valid  scheduling  solutions  for  the  biquad  filter 
which  use  1  adder  and  1  multiplier  for  an  iteration  period  of  4. _ 
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3 

0 

-1 

-3 

-4 

-2 

-1 

2 

1 

4 

0 

-1 

-3 

-6 

-1 

-4 

2 

1 

5 

0 

-1 

-3 

-4 

-1 

-6 

2 

1 

6 

0 

-1 

-3 

-4 

-1 

-2 

2 

1 

Table  2.4:  The  r  and  p  values  for  the  six  valid  scheduling  solutions  for  the  biquad  filter 
which  use  1  adder  and  1  multiplier  for  an  iteration  period  of  4. 


sol’n  # 

ri 

la 

rj 

rs 

Pi 

P2 

P3 

P4 

P5 

P6 

P7 

P8 

1 

0 

-1 

-1 

0 

0 

0 

3 

1 

3 

2 

0 

2 

1 

2 

0 

-1 

-1 

0 

0 

0 

3 

1 

0 

2 

3 

2 

1 

3 

0 

-1 

-1 

-1 

-1 

0 

0 

0 

3 

1 

0 

2 

3 

2 

1 

4 

0 

-1 

-1 

-2 

-1 

-1 

0 

0 

0 

3 

1 

2 

3 

0 

2 

1 

5 

0 

-1 

-1 

-1 

-1 

-2 

0 

0 

0 

3 

1 

0 

3 

2 

2 

1 

6 

0 

-1 

-1 

-1 

-1 

-1 

0 

0 

0 

3 

1 

0 

3 

2 

2 

1 
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(U)  (c)  in 

Figure  2.14:  The  six  scheduling  solutions  for  the  biquad  filter  which  use  1  adder  and  1 
multiplier.  The  number  in  parentheses  next  to  a  node  is  the  time  partition  to  which  the 
node  is  scheduled. 

ble  number  of  registers  required  for  the  implementation  is  10,  and  only  2  of  these  99 

■iT 

schedules  use  10  registers.  These  schedules  are  s  =  031  —2  142—1 

r  iT 

and  8=  031  —2  2530  .  The  minimxim  number  of  registers  is  computed 

using  the  techniques  in  [46]  with  the  modification  that  the  results  reported  here  assume 
that  for  a  processor  that  is  pipelined  by  Pu  stages,  the  Pu  pipelining  registers  cannot  be 
used  by  output  samples  from  other  processors,  while  the  results  in  [46]  allow  one  pipelin¬ 
ing  register  to  be  shared  by  other  processors.  For  the  iteration  period  N  =  A,  the  filter  in 
Figure  2.15  has  910910  scheduling  solutions,  and  10083  of  these  schedules  use  1  adder 
and  1  multiplier.  Of  these  10083  schedules,  the  minimum  possible  number  of  registers 
required  for  the  implementation  is  11,  and  21  of  these  10083  solutions  use  11  registers. 

2.5,2  The  Solution-Generate  Method 

This  section  describes  a  technique  for  exhaustively  generating  only  the  bit-parallel  sched¬ 
ules  which  can  be  implemented  on  a  given  set  of  hardware  resources.  Using  this  tech- 
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Figure  2.15:  The  4-stage  pipelined  8-th  order  all-pole  lattice  filter.  The  edge  labels  are 
in  parentheses  to  avoid  confusion  with  the  node  labels.  One  possible  spanning  tree  is 
shown  in  solid  lines. 

nique,  we  can  avoid,  generating  those  schedules  which  use  more  resources  than  are  avail¬ 
able,  and  this  allows  us  to  generate  the  desirable  schedules  in  considerably  less  time. 
The  following  theorem  is  needed  so  we  can  construct  B  in  a  manner  that  allows  us  to 
perform  exhaustive  bit-parallel  scheduling  with  resource  constraints. 

Theorem  2.7  In  Algorithm  FFL,  let  vj  be  the  node  that  the  link  1^  is  incident  from.  If 
vj  is  in  then  there  are  no  branches  in  loop{k)  which  are  also  in  {G—G^l^^).  Ifvj  is  in 
{G  —  G^^^),  then  there  are  branches  in  loop{k)  which  are  in  {G  —  G^l^^),  and  these  branches 
form  an  elementary  directed  path  which  rue  shall  denote  as  vq  ^  v\  ^  u./_i  ^  vj. 

Proof:  The  loop  denoted  as  loop(A;)  in  Algorithm  FFL  has  the  form  of  Figure  2.16(a) 
or  2.16(b),  where  is  the  root  node  of  the  spanning  tree  and  u/yv  is  a  node  in 
Recall  from  Theorem  2.5  that  the  form  in  Figure  2.16(b)  results  from  vj  ^  vijsf 
^COMMON  vii'^  ^COMMON  vj.  Both  forms  of  loop(A:)  can  be  generalized  as  the 
loop  in  Figure  2.16(c),  where  V[n,  Vy,  and  pB  are  in  G^^ .  The  proof  has  two  cases, 
which  take  into  account  whether  or  not  node  uj  is  in  G^^ . 

Case  I:  vj  is  in  G^^ .  If  the  path  pA  in  Figure  2.16(c)  has  any  edges  in  (G  -  G^^),  then 
a  subpath  V2  of  pA  must  exist  in  (G  —  G^^),  where  V2  is  in  G^\  The  last  edge  in 
vi  U2,  i.e.,  the  edge  that  is  incident  into  V2,  cannot  be  a  link  because  Zjt  is  the  only 
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(a)  (b)  (c) 


Figure  2.16:  (a)  One  form  of  loop(A:):  Link  Zjt  is  in  (G  —  and 

path  p2  is  in  .  (b)  The  other  form  of  loop(A:):  vj  ^  vjn  vcommon  vj.  Link 

Ik  is  in  (G  —  and  path  p4  is  in  G^^^  (c)  Equivalent  [oop{k):  vj  ^  v/t^j  vy  vj. 

Link  Ik  is  in  (G— G^^^)  and  path  ps  is  in  g\^K  The  forms  in  (a)  and  (b)  can  be  generalized 
to  the  form  in  (c). 

link  which  is  in  loop(A:)  and  in  (G  -  G^^^)  (recall  that  loop(A;)  is  in  G^^^  U  Gr  U  Ik).  The 
last  edge  in  ui  V2  cannot  be  a  branch  because  Lemma  2.3  says  there  is  no  branch  in 
(G  —  Gyi^^)  which  is  incident  into  a  node  in  G^^  Therefore,  if  uj  is  in  G^ll\  pa  can  have 
no  edges  in  (G  -  and  there  are  no  branches  that  are  in  loop(A:)  and  in  (G  —  G^^). 

Case  II:  vj  is  in  (G  -  G^^).  The  edge  incident  into  vj  in  loop{A:)  is  in  (G  -  G^^)  (if 
not,  vj  would  be  in  G^'),  and  this  edge  is  a  branch  because  Ik  is  the  only  link  which  is 
in  loop(A:)  and  in  (G  -  We  denote  the  branch  in  loop(A:)  which  is  incident  into  vj 

as  vj-i  vj.  Similarly,  if  vj-\  is  in  (G  —  G^^),  then  branch  bj-i  exists  in  (G  —  G^^^) 
to  form  the  path  vj^2  vj-i  H  vj.  On  the  other  hand,  if  vj-i  is  in  G^\  then 
by  using  Case  I  of  this  proof,  we  know  that  the  path  vy  vj-i  can  have  no  edges 
in  (G  —  Continuing  this  argument,  we  see  that  when  uj  is  in  (G  —  G^^),  there 

are  branches  which  are  in  loop(A:)  and  in  (G  —  G^^),  and  these  branches  form  the  path 

61  62  bj^i  hj 

Vq  Vi  ■  •  ■  — ^  Vj  —  \-^Vj.  □ 
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As  described  in  Section  2.2.3,  we  construct  the  fundamental  loop  matrix  B  by  letting 
\oop{k)  from  Algorithm  FFL  be  the  A;-th  row  of  B.  The  edges  in  the  graph  are  numbered 
such  that  the  first  (IVI  —  1)  columns  of  B  correspond  to  the  branches  of  the  spanning 
tree  of  G,  and  the  remaining  (|E|  -  jVI  +  1)  columns  correspond  to  the  links.  From 
Theorem  2.7  we  know  that  if  there  are  branches  in  loop(A:)  which  are  in  {G  —  then 
these  branches  form  the  elementary  directed  path  vq  ^  vy  ^  vj-i  ^  vj.  In 

other  words,  if  loop(A:)  contains  branches  which  have  not  appeared  in  previous  loops,  then 
these  branches  form  a  path.  These  branches  are  assigned  to  the  next  available  columns 
of  B  in  the  order  that  they  appear  in  the  path  uq  ui  ^  vj-\  ^  vj.  The  link 

Ik  is  assigned  to  the  (1^1  —  1  +  A;)-th  column  of  B.  By  constructing  the  fundamental  loop 
matrix  in  this  manner,  it  still  has  the  form  given  in  (2.3);  however,  it  now  allows  us  to 
use  Algorithm  IE  to  determine  the  schedule  values  of  the  nodes  directly. 

The  interval  for  the  scheduling  problem  is  found  by  enforcing  (2.22)  for  all  k 
such  that  bkn  =  1-  Assume  that  the  edge  n  is  incident  into  node  Vn  and  incident  from 
node  Un,  i.e.,  A  Vn-  From  (2.7),  the  expression  for  the  n-th  folded  edge  weight  is 
fn  =  Nwn  —  dun  +  —  ,s„^ .  Substituting  this  into  the  interval  for  /„  gives 


0<Nwn-  du„  +  -  .s„„  <  h[{Nw  -  d„)  -  ^  bkjfj 

jeD 


for  all  k  such  that  bkn  =  1-  Solving  for  s„„  gives 


-Nwn  +  du„  +  Su„  <  SDn  <  -Nwn  +  du„  +  s„„  +  hj (iVw  -  du)  -  ^  bkjfj 

jeD 

for  all  k  such  that  bkn  =  1- 

To  avoid  confusion  with  the  interval  for  /„  (recall  that  we  denoted  this  as  I„),  the 
interval  for  is  denoted  as  I".  This  notation  specifies  that  is  an  interval  for  the 
scheduling  value  of  the  node  that  edge  n  is  incident  into.  Let  =  —Nwn  +  du„  +  Su„- 
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Then  the  interval  is  simply  the  interval  In  from  Algorithm  IE  with  o;„  added  to  the 
lower  and  upper  bounds.  We  shall  denote  this  as 

Using  the  technique  described  in  this  section  for  constructing  the  fundamental  loop 
matrix  B,  Algorithm  IE  can  be  used  to  determine  the  intervals  for  the  folded  edge 
weights,  and  the  intervals  for  the  scheduling  values  for  the  nodes  can  be  found  using 

^n—^n  +  Oln- 


Example  2.9  In  this  example,  all  possible  scheduling  solutions  are  generated  for  the 
DFG  in  Figure  2.17  for  an  iteration  period  of  4  by  generating  the  solutions  for  s  directly. 
The  computation  time  for  each  node  is  assumed  to  be  unity.  Using  the  technique  described 
in  this  section  for  constructing  B  results  in 


1  1  0  0  0  1  0  0  0 
0  1  1  0  0  0  1  0  0 
0  0  1  1  0  0  0  1  0 
0  0  0  0  1  0  1  0  1 


(2.24) 


Notice  that  the  edge  labels  in  Figure  2.17  are  different  than  those  used  in  Figure  2.7.  The 
labels  have  been  changed  so  the  column  numbers  of  B  in  (2.24)  correspond  to  the  edge 
labels  in  Figure  2.17.  Using  B{Nw  —  d„)  =  the  intervals  are  given  in  Table  2.5. 

Note  that  in  this  table  fn  =  Nwn  —  du„  +  .s„„  —  Su„  has  been  used  to  simplify  the  upper 
bounds  of  the  intervals. 


Figure  2.17:  The  graph  scheduled  in  Example  2.9. 


50 


Table  2.5:  The  intervals  for  Example  2.9. 


n 

In 

CHji 

1 

[0,1] 

1  +  si 

[1,2] 

2 

[0,l-/i] 

—3  +  S3 

[—3  +  S3,  —1] 

3 

[0,1 -M 

1  +  S2 

[H-S2,-l  +S3] 

4 

[0,1 -/a] 

1  +  S5 

[1  +  S5, 3  +  S2] 

5 

[0,1 -M 

1  +  S3 

[1  +  S3, 3  +  S5] 

The  code  for  this  example  is 

for  (s3  =1;  s3  <=  2;  s3++) 
for  (s2  =  -3  +  s3;  s2  <=  -1;  s2++) 

for  (s5  =  1  +  s2;  s5  <=  -1  +  s3;  s5++) 

for  (s4  =  1  +  s5:  s4  <=  3  +  s2;  s4++) 

for  (s6  =  1  +  s3:  s6  <=  3  +  s5;  s6++) 

{ 

Compute  link  weights.  If  all  positive,  print  si  through  s6 

} 


The  twelve  solutions  for  s  generated  from  this  code  are  the  same  as  those  listed  in  Ta¬ 
ble  2.1. 

By  determining  the  values  of  the  schedule  vector  directly  rather  than  first  determining 
the  folding  vector  and  then  computing  the  schedule  vector,  we  can  generate  only  those 
schedules  which  can  be  executed  using  a  limited  number  of  hardware  modules.  This  is 
done  using  a  programming  technique  that  avoids  the  solutions  which  use  more  resources 
than  are  available.  For  each  operation  type  (e.g.,  addition  or  multiplication),  an  array 
of  N  data  elements  is  used  such  that  there  is  one  element  for  each  time  partition  from 
0  to  —  1.  Each  data  element  contains  the  number  of  operations  of  a  given  type  that 
is  currently  scheduled  to  that  time  partition.  Each  data  element  also  keeps  track  of 
the  next  time  partition  in  which  the  hardware  resources  for  that  particular  operation 
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type  are  not  fully  utilized.  By  keeping  track  of  this  information,  when  we  generate  a 
new  schedule  by  incrementing  the  schedule  value  for  a  node,  the  node  is  scheduled  to  a 
time  partition  in  which  the  hardware  resources  for  the  operation  are  not  already  fully 
utilized.  The  end  result  is  that  we  do  not  generate  the  schedules  that  use  more  resources 
than  are  available,  so  we  can  generate  all  scheduling  solutions  for  a  given  set  of  resource 
constraints  much  more  quickly  than  if  we  find  all  possible  schedules  and  keep  only  those 
schedules  which  satisfy  the  resource  constraints. 

The  advantages  of  including  the  resource  constraints  are  demonstrated  using  the 
fifth-order  wave  digital  elliptic  filter  shown  in  Figure  2.18.  We  assume  that  addition 


Figure  2.18:  The  fifth-order  wave  digital  elliptic  filter.  The  branches  of  the  spanning  tree 
used  in  Algorithm  FFL  is  shown  with  solid  lines,  and  the  links  are  shown  with  dotted 
lines. 

and  multiplication  require  1  and  2  units  of  time,  respectively,  and  that  hardware  adders 
and  multipliers  are  pipelined  by  1  and  2  stages,  respectively.  The  results  of  exhaustively 
generating  the  scheduling  solutions  without  considering  resource  constraints  are  shown 
in  Table  2.6.  The  results  of  exhaustively  generating  the  scheduling  solutions  which  can 
be  implemented  on  a  given  number  of  hardware  adders  and  multipliers  are  shown  on 
the  left  side  of  Table  2.7.  Prom  these  tables,  we  can  see  that  the  time  it  takes  to 
exhaustively  generate  only  the  scheduling  solutions  which  satisfy  a  given  set  of  resource 
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Table  2.6:  The  results  of  exhaustively  scheduling  the  filter  in  Figure  2.18  using  the 
techniques  presented  in  Section  2.4.1. _ 


iter  period 

#  sched  solutions 

CPU  time  (sec) 

16 

9900 

0.0342 

17 

4669095 

16.2 

18 

580432280 

2020 

Table  2.7:  The  results  of  exhaustively  scheduling  the  filter  in  Figure  2.18  for  a  given  set 
of  resource  constraints  using  the  techniques  presented  in  Section  2.5.2.  The  left  part  of 
the  table  considers  scheduling  to  the  minimum  possible  number  of  adders  and  multipliers 
for  the  given  iteration  period,  and  the  right  part  considers  schednling  to  the  minimum 


iter 

period 

resources 
(add, mult) 

#  solns 

CPU  time 
(sec) 

resources 

(add,mult,reg) 

#  solns 

16 

(3,  1) 

77 

0.00288 

(3,  1,  7) 

21 

17 

(2,  1) 

98 

0.0518 

(2,  1,  7) 

73 

18 

(2,  1) 

131983 

11.1 

(2,  1,  7) 

40723 

19 

(2,  1) 

33948842 

1700 

(2,  1,  7) 

3056246 

constraints  is  orders  of  magnitude  faster  than  the  time  it  takes  to  exhaustively  generate 
all  scheduling  solutions.  The  expressions  in  [46]  can  be  used  to  compute  the  number  of 
registers  required  by  a  given  schedule.  The  results  of  this  are  shown  on  the  right  side 
of  Table  2.7.  Note  that  these  results  Jissume  that  internal  pipelining  registers  cannot 
be  shared  between  processors,  while  the  results  in  [46]  assume  that  internal  pipelining 
registers  can  be  shared  between  processors. 

2.6  Conclusions 

Formulations  have  been  presented  in  this  chapter  for  the  bit-parallel  and  bit-serial 
scheduling  problems,  and  we  have  shown  that  the  retiming  formulation  introduced  in 
[30]  is  a  special  case  of  our  bit-parallel  scheduling  formulation.  Techniques  have  been 
developed  and  demonstrated  for  exhaustively  generating  all  unique  retiming  and  schedul- 
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ing  solutions  for  a  strongly  connected  DFG.  These  techniques  allow  a  circuit  designer  to 
explore  the  space  of  possible  implementations. 

In  addition  to  the  technique  for  exhaustively  generating  all  unique  bit-parallel  schedul¬ 
ing  solutions,  a  technique  was  also  developed  for  exhaustively  generating  only  the  bit- 
parallel  scheduling  solutions  which  satisfy  a  given  set  of  resource  constraints.  Our  results 
indicate  that  this  technique  can  generate  schedules  in  CPU  times  that  are  greater  than 
two  orders  of  magnitude  faster  than  generating  all  solutions. 

One  advantage  of  the  formulations  presented  in  this  chapter  is  that  they  allow  us  to 
understand  how  retiming  and  scheduling  are  similar  and  that  retiming  is  an  important 
part  of  scheduling.  Specifically,  we  show  that  retiming  is  a  special  case  of  scheduling, 
and  we  include  retiming  in  our  scheduling  formulations  to  make  them  general  and  to 
make  visible  the  role  of  retiming  during  scheduling. 

The  numbers  reported  in  Tables  2.6  and  2.7  show  some  scheduling  results  for  the  fifth- 
order  wave  digital  elliptic  filter.  Since  this  filter  is  often  used  to  demonstrate  scheduling 
techniques,  the  numbers  in  these  tables  provide  some  benchmarks  for  gauging  the  effec¬ 
tiveness  of  scheduling  algorithms.  These  numbers  indicate  that  the  number  of  schedules 
increases  dramatically  tis  the  difference  between  the  iteration  period  and  the  iteration 
bound  becomes  larger.  Therefore,  for  practical  applications,  our  exhaustive  scheduling 
techniques  are  most  useful  when  the  iteration  period  is  at  or  near  the  iteration  bound. 
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Chapter  3 

Register  Minimization  in  Folded 
Architectures 

3.1  Introduction 

In  this  chapter,  expressions  are  derived  for  the  minimum  number  of  registers  required 
to  implement  a  statically  scheduled  DFG.  Two  cases  are  considered,  namely,  the  cases 
where  retiming  is  and  is  not  allowed  to  bo  perfomed  on  the  scheduled  DFG. 

We  begin  with  a  motivating  example.  After  the  DFG  has  been  scheduled,  specifica¬ 
tions  for  the  communication  paths  between  hardware  modules  can  be  determined  using 
systematic  folding  techniques  [28].  Consider  the  multiply-add  operation  in  Figure  3.1(a), 
which  is  an  algorithm  DFG  describing  y(n)  =  au{n)  -f  v{n).  Assume  this  multiply-add 
is  part  of  a  larger  DFG  which  is  to  be  implemented  in  hardware  with  an  iteration  period 
of  10,  i.e.,  each  node  in  the  algorithm  DFG  will  be  executed  by  the  hardware  exactly 
once  every  10  time  units.  If  the  multiply  operation  is  executed  by  one-stage  pipelined 
hardware  module  Hm  at  time  units  10/  +  2,  and  the  add  operation  is  executed  by  hard¬ 
ware  module  Ha  at  10/  -t-  8  for  integer  /  iterations,  then  the  connection  between  the 
multiplication  and  addition  operations  in  Figure  3.1(a)  is  mapped  to  the  data  path  in 
Figure  3.1(b)  (details  of  how  this  data  path  specification  is  derived  are  provided  in  Sec- 
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tion  3.2.2).  Upon  examination  of  Figure  3.1(b),  one  observes  that  at  any  given  time, 
no  more  than  one  of  the  five  delays  labeled  “5D”  between  Hm  and  Ha  is  storing  a 
word  of  data  that  will  actually  be  consumed  by  Ha-  To  avoid  the  inefficient  architec¬ 
ture  that  would  result  from  direct  implementation  of  Figure  3.1(b)  in  silicon,  memory 
management  is  used  in  high-level  synthesis  tools  to  derive  efficient  data  paths  between 
processing  modules. 


u(n) 

v(n) — -Q — -  y(n) 

(a) 


10/+8 


(b) 


Figure  3.1:  (a)  Algorithm  DFG  describing  y{n)  =  au{n)  -f  v{n).  (b)  Data  path  specifi¬ 
cation  derived  from  the  algorithm  DFG  for  an  iteration  period  of  10. 


Memory  management  consists  of  choosing  the  type  of  registers,  number  of  registers, 
and  allocation  of  data  to  these  registers.  The  type  of  registers  is  usually  dictated  by 
the  architecture  model  used.  Throughout  this  chapter,  the  term  “register”  is  used  to 
describe  a  storage  location  capable  of  storing  one  word  of  data.  We  use  the  term  “memory 
model”  for  a  general  rule  which  describes  how  data  can  be  allocated  to  the  registers.  For 
example,  one  memory  model  might  force  each  functional  unit  in  the  architecture  to  store 
its  output  samples  in  a  set  of  registers  dedicated  to  only  that  functional  unit,  while 
another  memory  model  might  lift  this  restriction  and  allow  all  of  the  functional  units 
to  share  a  common  set  of  registers.  Naturally,  the  memory  model  affects  the  number 
of  registers  and  the  allocation  of  data  to  the  registers.  In  this  chapter,  we  compute 
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the  minimum  number  of  registers  required  for  a  statically  scheduled  DFG  under  various 
memory  models.  The  allocation  of  the  data  to  registers  is  an  NP-complete  problem  for 
which  heuristic  algorithms  have  been  suggested  [51,  52,  53]. 

Techniques  for  computing  the  minimum  number  of  registers  required  by  a  statically 
scheduled  DFG  have  been  considered  in  the  past.  The  left-edge  algorithm  has  been 
used  to  find  the  minimum  number  of  registers  and  allocate  data  to  these  registers  [54]. 
The  life-time  chart  and  circular  life-time  graph  can  be  used  to  determine  the  minimum 
number  of  registers  in  any  DSP  circuit  [29].  The  circular  life-time  graph  is  particularly 
useful  because  it  graphically  takes  into  account  the  repetitive  and  periodic  nature  of  DSP 
operations.  These  graphs  have  been  used,  for  example,  to  determine  the  size  of  register 
files  in  DSP  architectures  [52]. 

In  this  chapter,  we  use  life-time  analysis  to  derive  closed-form  expressions  for  the 
minimum  number  of  registers  required  by  a  statically  scheduled  DSP  program.  These 
techniques  offer  several  advantages  over  previously  used  techniques.  First,  the  closed- 
form  expressions  can  be  used  to  represent  cost  functions  for  high-level  synthesis  opti¬ 
mization  toots.  An  example  of  using  these  closed-form  expressions  in  an  integer  linear 
programming  (ILP)  formulation  is  given  in  Section  3.4.  Second,  the  analytical  tools  we 
introduce  can  be  used  to  derive  expressions  for  the  minimum  number  of  registers  un¬ 
der  a  variety  of  memory  models  which  describe  how  data  can  be  allocated  to  memory. 
This  is  important  because  the  target  architecture  may  impose  constraints  on  how  data 
can  be  routed  to  memory.  We  derive  expressions  for  three  memory  models,  namely  the 
operation- constrained,  processor-constrained,  and  unconstrained  memory  models.  For 
the  unconstrained  memory  model,  where  all  memory-sharing  constraints  are  relaxed, 
the  minimum  number  of  registers  required  to  implement  a  DFG  with  m  nodes  can  be 
computed  in  0{m?)  time.  A  third  advantage  of  the  analytical  tools  we  introduce  is 
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that  they  can  be  used  to  determine  memory  requirements  for  more  complex  algorithm 
descriptions,  such  as  DFGs  which  have  multiplexers  in  the  data  paths. 

Pipelining  and  retiming  [27]  are  powerful  tools  used  in  high-level  synthesis.  Pipelining 
can  be  considered  to  be  a  special  case  of  retiming.  We  consider  an  integer  linear  pro¬ 
gramming  solution  to  the  retiming  problem,  referred  to  as  the  minimum  physical  storage 
location  (MPSL)  retiming,  which  retimes  a  scheduled  DFG  such  that  its  memory  re¬ 
quirements  are  minimized  under  the  unconstrained  memory  model  while  the  schedule 
remains  valid  for  the  retimed  DFG.  We  use  MPSL  retiming  to  retime  a  DFG  which 
has  been  scheduled  using  the  MARS  design  system  [26],  and  we  compare  the  memory 
requirements  of  MARS  to  a  globally  optimal  solution.  Our  results  show  that  the  MARS 
system  gives  optimal  or  close-to-optimal  results  in  terms  of  memory  requirements. 

The  results  we  present  can  be  used  throughout  the  high-level  synthesis  process.  Ex¬ 
pressions  for  the  minimum  number  of  registers  can  be  used  during  scheduling  to  help 
determine  the  total  cost  of  the  architecture.  After  scheduling,  MPSL  retiming  can  be 
used  to  optimally  retime  a  DFG  in  terms  of  registers  required  for  its  implementation. 
During  memory  management,  our  techniques  can  be  used  to  optimize  the  hardware  de¬ 
sign  in  terms  of  the  number  of  registers  required.  For  instance,  given  the  scheduled  DFG 
and  the  desired  memory  model,  the  minimum  number  of  registers  required  can  be  de¬ 
termined,  and  register  allocation  can  be  performed  by  an  appropriate  register  allocation 
scheme  which  guarantees  completion  (e.g.,  forward-backward  register  allocation  [51]). 
Expressions  for  the  minimum  number  of  registers  can  also  be  used  to  evaluate  the  effec¬ 
tiveness  of  register  allocation  schemes  which  are  based  on  heuristics,  since  some  schemes 
may  require  more  memory  than  the  theoretical  lower  bound  in  order  to  maintain  simple 
control  structures. 

This  chapter  is  organized  as  follows.  The  algorithm  DFG  model  and  the  pipelined  pro- 
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cessor  model  used  in  the  chapter  are  described  in  Section  3.2.  This  section  also  describes 
the  systematic  folding  techniques  which  are  used  as  a  framework  for  our  derivations. 
Expressions  are  derived  in  Section  3.3  to  compute  the  minimum  number  of  registers  re¬ 
quired  to  implement  a  statically  scheduled  DFG  for  various  memory-sharing  models.  In 
Section  3.4,  memory  minimization  is  considered  simultaneously  with  retiming,  and  our 
conclusions  are  presented  in  Section  3.5. 

3.2  Preliminaries 

The  DFG  model  we  consider  represents  periodic  and  nonterminating  data-flow  programs. 
We  consider  homogeneous  (single-rate)  DFGs,  where  each  node  is  executed  once  per 
iteration;  however,  the  techniques  used  in  this  chapter  can  also  be  applied  to  multiratc 
DFGs  since  any  well-behaved  multirate  DFG  can  be  transformed  into  an  equivalent 
single-rate  DFG  [55],  [56].  Memory  requirements  for  multirate  DSP  program  descriptions 
have  also  been  considered  [57],  [58].  In  each  iteration  of  the  homogeneous  DFGs  we 
consider,  a  node  consumes  exactly  one  sample  from  each  arc  that  is  input  to  the  node 
and  produces  exactly  one  sample  which  is  available  at  the  output  of  the  node.  Each 
occurrence  of  a  data  path  connecting  the  output  of  a  node  to  an  input  of  a  node  is 
called  an  arc.  Figure  3.2(a)  shows  one  representation  of  a  DFG  which  contains  four  arcs, 
namely  arc  U  -^V\  with  0  delays,  arc  U  Vi  with  4  delays,  arc  f7  — >  V2  with  2  delays, 
and  arc  U  U  with  1  delay.  Figure  3.2(b)  shows  another  representation  of  the  same 
DFG.  In  this  chapter,  the  DFG  simply  provides  a  program  description.  As  a  result,  the 
two  representations  in  Figures  3.2(a)  and  (b)  can  be  considered  equivalent  since  they 
describe  the  same  DSP  program. 

The  DFG  is  assumed  to  have  no  multiplexers  and  no  conditional  branches.  When 
computing  the  number  of  registers  required  to  implement  a  DFG,  G,  it  is  assumed  that 
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Figure  3.2:  (a)  A  DFG  with  four  arcs,  (b)  Equivalent  representation  of  the  DFG  shown 
in  (a). 

all  arcs  in  G  have  both  a  source  node  and  a  sink  node  in  G.  Arcs  which  communicate 
with  the  outside  world  can  be  included  by  introducing  dummy  nodes. 

The  following  subsections  describe  the  pipelined  processor  model  used  in  this  chapter 
and  the  systematic  folding  techniques  which  form  a  framework  for  our  derivations. 

3.2.1  The  Pipelined  Processor  Model 

Consider  a  processor  H  with  P  pipelining  stages  and  computational  latency  of  T  units. 
This  pipelined  processor  is  often  represented  as  shown  in  Figure  3.3(a).  The  hardware 
in  the  diished  box  in  Figure  3.3(a)  is  referred  to  as  A  more  explicit  representation 

of  is  shown  in  Figure  3.3(b),  where  the  computational  latency  of  each  sub-oper¬ 
ator  i/2,  •  •  • ,  Hp  is  assumed  to  be  T/F.  The  dashed  box  shows  that  the  P  delays 

£>i,  F>2,  •  •  • ,  Dp  are  internal  to  and  cannot  be  accessed  by  other  data  paths. 

Consider  the  implementation  of  the  pipelined  processor  H  shown  in  Figure  3.3(c). 
The  hardware  in  the  dashed  box  in  Figure  3.3(c)  is  referred  to  as  In  this  case,  the 

P'  =  P  —  I  delays  Fi, £>2, •  •  • , Pp-i  are  internal  to  but  the  delay  Fp  is  external 
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to  )  and  can  be  accessed  by  other  data  paths.  A  simplified  version  of  this  model 
is  shown  in  Figure  3.3(d).  The  structure  shown  in  Figure  3.3(d)  may  not  be  acceptable 
for  some  applications  due  to  the  multiplexer  delay,  Tmux-  The  final  stage  of  pipelined 
processor  H  has  a  computational  latency  of  T/Zp+T^^/x,  where  T//p  is  the  computational 
latency  of  Hp.  If  T//p  +  Tmux  is  greater  than  the  desired  clock  period,  Tdesired,  then 
the  multiplexer  must  be  eliminated  and  the  delay  Dp  can  be  dedicated  to  processor  H 
as  in  Figure  3.3(b).  Throughout  this  chapter,  we  assume  Tpp  +  Tmux  <  TdesireDi  so 
that  the  pipelined  processor  model  can  be  used  and  the  delay  Dp  can  be  accessed 
by  outputs  of  other  processors,  iis  shown  in  Figure  3.3(d).  We  also  assume  P  >  1  so 
that  P'  is  nonnegative.  When  computing  the  minimum  number  of  registers  required 
to  implement  a  statically  scheduled  DFG,  we  do  not  count  the  P'  registers  which  are 
internal  to  the  processor. 


_ _ ' _ 

(c)  (d) 


Figure  3.3:  (a)  Implementation  of  P-stage  pipelined  processor  H  with  lumped  pipelining 
delays,  (b)  Pipelined  processor  with  separated  internal  pipelining  delays,  (c)  Pipelined 
processor  where  the  last  pipelining  delay  can  be  shared  with  other  data  paths,  (d)  A 
simplified  version  of  (c). 
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3.2.2  Systematic  Folding  Techniques 


The  folding  transformation  formalized  in  [28]  gives  a  method  of  systematically  determin¬ 
ing  control  circuit  specifications  from  a  statically  scheduled  DFG.  This  section  presents 
a  brief  introduction  to  these  systematic  folding  techniques. 

Consider  the  algorithm  DFG  in  Figure  3.4(a)  which  contains  the  arc  U  V  with 

i  delays.  In  this  system,  the  result  of  the  /-th  iteration  of  operation  U  is  used  for  the 
(/  -t-  z)-th  iteration  of  operation  V .  Let  N  be  the  folding  factor,  i.e.,  N  operations  are 
executed  using  a  single  hardware  operator.  Furthermore,  let  u  and  v  be  the  folding 
orders  of  U  and  V,  respectively.  The  folding  order  describes  the  time  partition,  or  the 
time  unit  modulo  N,  in  which  an  operation  is  scheduled,  i.e.,  the  f-th  iteration  of  U  is 
scheduled  to  be  executed  by  hardware  operator  Hu  at  time  unit  [Nl  +  u).  Similarly,  the 
(/  i)-th  iteration  of  V  is  scheduled  to  bo  executed  by  hardware  operator  Hy  at  time 
unit  N{1  -t-  i)  +  V.  If  Hy  has  Py  pipelining  stages  and  the  pipelined  processor  model 
(see  Figure  3.3(d))  is  used,  then  the  result  of  the  f-th  iteration  of  U  is  output  from 
h\^  ^  at  {Nl  +  U  +  Py)^  whore  P'y  =  Py  —  The  folding  process  maps  each  arc  U  V 
with  i  delays  in  the  algorithm  DFG  to  an  arc  in  the  architecture  DFG.  We  denote  by 
Df{U  — >  V)  the  number  of  delays  on  the  arc  in  the  architecture  DFG  which  is  the  result 
of  folding  arc  U  V  \n  the  algorithm  DFG.  This  delay  is  the  difference  between  the 
execution  time  of  the  {I  +  f)-th  iteration  of  V  and  the  time  that  the  result  of  the  f-th 
iteration  of  U  is  available,  i.e., 

Df{U  ->•  F)  =  N{1  +  i)  +  v  —  {Nl  -h  u  -I-  Py)  =  Ni  —  Py  +  V  -  u.  (3.1) 

Note  that  the  number  of  folded  delays  is  iteration  independent,  i.e.,  Df{U  ->  V)  is 
independent  of  1.  Hardware  operator  Hy,  which  is  pipelined  by  Py  stages  and  has  Py 
internal  pipelining  delays,  is  connected  to  hardware  operator  Hy  at  switching  instance 
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{Nl  +  v)  with  Dp{U  ->  V)  delays,  as  shown  in  Figure  3.4(b).  This  derivation  differs 
slightly  from  the  derivation  in  [28]  since  here  we  use  the  pipelined  processor  model 
(see  Figure  3.3(d)),  where  the  pipelined  processor  model  (see  Figure  3.3(a))  is  used 
in  [28]. 
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(a)  (b) 


Figure  3.4:  (a)  An  arc  U  ^  V  in  the  algorithm  DFG.  (b)  The  mapping  of  the  folded  arc 
in  the  architecture  DFG. 

A  folding  set  is  an  ordered  set  of  operations  which  are  executed  by  the  same  processor. 
Each  folding  set  contains  N  entries,  some  of  which  may  be  null  operations.  The  operation 
in  the  j-th  position  within  the  folding  set  (where  j  goes  from  0  to  Af- 1)  is  executed  by  the 
proces.sor  during  time  partition  j.  For  example,  consider  the  folding  set  Si  =  {Ai,  0,  A2} 
for  N  =  'S.  Operation  Ai  belongs  to  folding  set  Si  with  folding  order  0  (also  denoted  as 
(5i|0)),  and  operation  A2  belongs  to  folding  sot  Si  with  folding  order  2  (also  denoted 
^  (‘S'il2)).  Due  to  the  null  operation  in  the  1-st  position  within  Si,  the  operator  that 
executes  operations  Ai  and  A-z  will  not  be  utilized  at  time  instances  3Z  -I- 1.  For  a  folded 
system  to  be  realizable,  Dp{U  ^  V^)  >  1  must  hold  for  all  arcs.  Once  valid  folding  sets 
have  been  assigned,  pipelining  and  retiming  can  be  used  to  satisfy  this  property  (see 
[28]). 

In  the  folded  realization,  the  data  on  the  system  input  is  assumed  to  be  valid  for  N 
clock  cycles  before  changing.  For  example,  if  iV  =  2  and  the  folded  realization  is  assumed 
to  operate  with  period  T,  then  the  input  sample  x[0]  must  be  valid  from  0  to  2T,  x[l] 
must  be  valid  from  2T  to  4r,  etc. 

We  demonstrate  the  use  of  systematic  folding  techniques  by  folding  the  biquad  filter 
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in  Figure  3.5(a).  Assume  addition  and  multiplication  require  1  and  2  units  of  time, 
respectively  (i.e.,  =  1  and  Tm  =  2),  and  one-stage  pipelined  adders  and  two-stage 

pipelined  multipliers  are  available  (i.e.,  =  1  and  Pm  =  2).  A  retimed  version  of  this 

filter  with  valid  folding  sets  assigned  using  folding  factor  AT  =  4  is  shown  in  Figure  3.5(b). 
Folding  factor  N  =  4  means  that  the  iteration  period  of  the  folded  hardware  is  4  time 
units,  i.e.,  each  node  of  the  biquad  filter  is  executed  exactly  once  every  4  time  units  in 
the  folded  DFG.  The  folded  circuit  is  shown  in  Figure  3.6.  To  see  how  the  folded  DFG  in 
Figure  3.6  is  obtained  from  the  algorithm  DFG  in  Figure  3.5(b),  consider  arc  Ai  M4. 
Using  (3.1),  we  find 

Df{Ai  ->  M4)  =  4(2)  -  0  +  1  -  3  =  6. 

This  moans  there  is  an  arc  in  the  folded  DFG  from  the  adder  to  the  multiplier  with  6 
delays.  Since  this  arc  ends  at  node  Mj,  which  has  folding  order  1  in  the  algorithm  DFG, 
the  folded  arc  is  switched  at  the  input  of  the  multiplier  in  the  folded  DFG  at  41  +  1. 
This  folded  arc  is  shaded  in  Figure  3.6.  Using  Figure  3.1(a)  <is  another  example  and 
Jissigning  folding  orders  2  and  8  to  the  multiply  and  add  operations,  respectively,  and 
using  =  10  and  Pm  —  2,  we  get  10(0)  —  1-1-8  —  2  =  5  delays  in  the  folded  arc  as  shown 
in  Figure  3.1(b). 


(a)  (b) 


Figure  3.5:  (a)  The  biquad  filter,  (b)  The  retimed  filter  with  valid  folding  sets  assigned. 


The  folded  DFG  in  Figure  3.6  represents  the  data  path  specifications  obtained  from 
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Figure  3.6:  The  folded  biquad  filter  using  the  specifications  given  in  Figure  3.5(b).  The 
shaded  arc  represents  arc  Ai  ->  M4  in  the  folded  DFG. 

the  scheduled  algorithm  DFG  by  using  (3.1);  however,  this  DFG  does  not  represent  the 
most  efficient  implementation  of  the  scheduled  DFG  in  terms  of  memory  usage.  Through¬ 
out  the  remaining  sections  of  this  chapter,  expressions  are  derived  for  determining  the 
most  efficient  implementation  of  a  statically  scheduled  DFG  in  terms  of  the  amount  of 
memory  required  for  the  implementation.  We  now  introduce  some  definitions  that  will 
be  used  in  these  derivations. 

Let  Xi,  /  >  0  be  the  result  of  the  f-th  iteration  of  operation  U.  Recall  that  each  node 
in  the  DFG  is  executed  exactly  once  per  iteration.  Throughout  this  chapter,  we  consider 
only  nonnegative  iterations  of  each  operation,  which  results  in  no  loss  of  generality. 
Variable  x;  is  produced  exactly  once  by  Hy,  but  may  be  consumed  multiple  times  by 
one  or  more  processors  due  to  the  possibility  of  fanout.  We  define  a  unique  production 
time  and  a  unique  consumption  time  for  each  variable. 

Definition  3.1  The  production  time  of  variable  xi,  denoted  as  px,,  is  the  time  unit  in 

(pf\ 

which  X/  is  output  from  ,  which  is  Nl  +  u  +  Plj.  The  consumption  time  of  xi, 
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denoted  as  Ci, ,  is  the  latest  time  unit  during  which  xi  is  input  to  any  processor. 

Recall  that  u  is  the  folding  order  of  operation  U,  which  is  the  time  partition,  or  time  unit 
modulo  iV,  in  which  the  operation  U  is  scheduled  to  be  executed  by  processor  Hu-  Since 
we  consider  only  nonnegative  iterations  of  nodes,  pi,  >  u  +  Py  always  holds.  Also,  the 
consumption  time  must  be  greater  than  the  production  time,  i.e.,  px,  <  Cx^  must  always 
hold  because  Dp{U  — )•  V)  >  1  is  assumed.  In  the  remainder  of  the  chapter,  px^  '>u  +  P[j 
and  Pxi  <  Cxi  are  implicitly  assumed.  We  use  px,  and  Cj,  to  define  the  time  interval  for 
which  the  variable  xi  is  live. 

Definition  3.2  The  variable  xj  is  live  for  all  time  units  in  the  interval  (pipC'i,]. 

3.3  Memory  Minimization  without  Retiming 

In  this  section,  we  derive  expressions  for  the  minimum  number  of  registers  required  to 
implement  a  DFG  ;issuming  that  the  DFG  has  already  been  scheduled  and  no  more 
circuit  transformations  (c.g.,  retiming)  are  to  be  performed  on  the  DFG.  The  minimum 
number  of  registers  required  to  store  the  variables  that  are  output  from  a  single  node 
is  first  computed.  The  operation-constrained,  processor-constrained,  and  unconstrained 
memory  models  arc  then  described,  and  expressions  are  derived  for  the  minimum  number 
of  registers  required  to  implement  arbitrary  DFGs  under  these  models. 

3.3.1  Minimum  Number  of  Registers  for  Outputs  from  a  Single  Node 

Before  considering  the  case  where  the  output  variables  of  a  node  are  broadcast  to  several 
arcs  (e.g.,  node  U  in  Figure  3.2),  we  consider  the  simple  case  of  a  single  arc  U  V  as 
shown  in  Figure  3.4(a).  The  minimum  number  of  registers  required  to  implement  the 
Dp{U  — )•  V)  delays  in  Figure  3.4(b)  can  be  calculated  using  life-time  analysis.  If  we  let 
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xi,  I  >  0,  he  the  result  of  the  l-th  iteration  of  node  U,  then  the  production  time  of  xt 
is  px(  =  u  +  Py  +  Nl  and  its  consumption  time  is  Ci,  =  px^  +  Df{U  ->  V).  Consider 
time  unit  K.  The  first  variable  that  is  produced  by  node  U  is  the  result  of  the  0-th 
iteration  of  U,  and  the  production  time  of  this  variable  is  defined  to  be  px^.  A  new 
variable  is  produced  by  node  U  every  N  time  units,  so  the  number  of  variables  which 
have  production  times  prior  to  time  unit  K  (i.e.,  which  satisfy  px^  <  K)  is 


^p,t/ 


^  Pxq 
N 


(3.2) 


where  [r]  is  the  ceiling  of  x,  which  denotes  the  smallest  integer  greater  than  or  equal 
to  X.  Using  a  similar  argument,  the  number  of  these  variables  with  consumption  times 
prior  to  time  unit  K  (i.e.,  which  satisfy  Cx,  <  K)  is 


c,u{K) 


N 


(3.3) 


Note  that  these  expressions  for  rp^(j{K)  and  rc,u{K)  are  valid  for  all  K  such  that 
rp,u{K)  >  0  and  rc,u{K)  >  0.  According  to  Definition  3.2,  a  variable  is  live  at  time 
unit  K  if  it  is  produced  prior  to  K  and  not  consumed  prior  to  K.  Therefore,  the  num¬ 
ber  of  live  variables  at  time  unit  K  is  the  difference  between  the  number  of  variables 
produced  prior  to  time  unit  K  and  the  number  of  variables  consumed  prior  to  time  unit 
K,  i.e.,  riiiie^uiK)  —  rp^[j{K)  —  rc,u{K).  Using  (3.2)  and  (3.3),  the  expression  for  the 
number  of  live  variables  at  time  unit  K  becomes 


'K-pxy 

K  Cxo 

N 

N 

(3.4) 


The  minimum  number  of  registers  required  to  implement  the  Df{U  -)•  U)  delays  in 
Figure  3.4(b)  is  the  maximum  value  of  riiye,u{K)  over  all  K.  The  value  of  rtive,u{K) 
is  periodic  in  K  with  period  N  because  the  folded  architecture  operates  periodically 
with  period  N.  Therefore,  we  only  need  to  evaluate  (3.4)  for  N  consecutive  time  units. 
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Evaluating  (3.4)  at  time  units  K  =  qN  +  n  for  some  integer  q  and  n  €  [0,  N)  results  in 
the  number  of  live  samples  at  time  partition  n,  given  by 

ru..Mn)  =  +  + 


where  +  Df{U  V)  has  been  used.  The  minimum  number  of  registers 

required  to  implement  the  Df{U  ->  V)  delays  in  Figure  3.4(b)  is  the  maximum  value  of 
Tiive.uin)  over  the  interval  n  G  [0,N),  i.e., 


qN  +  n-  PxQ 
N 

^  ~  PXQ 

N 


'qN  +  n-Cxo' 

N 

n-{p^g  +  DF{U^V)) 
N 


=  „max^  {r,.„e,[;(n)} . 

The  following  lemma  can  be  used  to  find  the  maximum  of  rnye^uif^)  for  n  6  [0,  N). 


Lemma  3.1  Given  integers  A,  B,  n,  and  N  >  0, 


max  < 
n^[0,N)  1. 

'  B  +  n 

'  B  —  A  +  n' 

\ 

'A' 

N 

N 

] 

N 

Proof:  Since 


'  B  +  n' 

'  B  —  A  +  n' 

N 

N 

(3.5) 


is  periodic  in  n  with  period  N,  we  only  need  to  show  that  the  maximum  of  this  expression 
is  ^ j  for  any  N  consecutive  integers.  Therefore,  it  is  sufficient  to  show  that 


max 

ne[A-B,A-B+N) 


[  [B  +  n' 

'  B  —  A  +  n' 

1_ 

'A' 

1  N 

N 

/■ 

N 

The  expression  in  (3.5)  equals  ior  n  =  A  —  B.  It  remains  to  show  that 


'5  +  n] 
N 


B  —  A  +  n 
N 


< 


A 

N 


holds  ioTn  =  A  —  B  +  l,A  —  B  +  2,...,A  —  B  +  N  —  1.  This  can  be  written  as 


'A  +  k' 

'k' 

< 

'A' 

N 

N 

N 

(3.6) 
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Time 


#  live  samples 


Figure  3.7:  (a)  A  fanout  node  U.  (b)  The  lifetime  chart  of  samples  in  the  folded  archi¬ 
tecture. 


Table  3.1:  Summary  of  the  three  memory  models  described  in  Section  3.3.2. 


memory  model 

outputs  of  the  nodes 
executed  by  the  same  processor 
can  share  registers 

outputs  of 
different  processors 
can  share  registers 

operation-constrained 
(Section  3.3.2) 

No 

No 

processor-constrained 
(Section  3.3.2) 

Yes 

No 

unconstrained 
(Section  3.3.2) 

Yes 

Yes 

in  G.  This  results  in  no  loss  of  generality  since  arcs  that  communicate  with  the  outside 
world  can  be  included  by  introducing  dummy  nodes.  Let  U  be  the  set  of  nodes  in  G  with 
at  least  one  output  arc  that  terminates  at  a  node  in  G.  In  this  section,  the  expressions 
derived  in  Section  3.3.1  are  used  to  compute  the  minimum  number  of  registers  required 
to  implement  G  for  the  operation-constrained,  processor-constrained,  and  unconstrained 
memory  models.  Table  3.1  gives  an  overview  of  the  three  memory  models  discussed  in 
this  section. 
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The  Operation-Constrained  Memory  Model 


In  the  operation- constrained  memory  model,  each  node  U  eU  in  G  is  allocated  a  unique 
set  of  registers  in  the  synthesized  hardware.  The  only  variables  which  are  allowed  to 
occupy  the  registers  allocated  to  U  are  those  variables  which  result  from  the  execution 
of  node  U.  As  a  result,  register  minimization  under  the  operation-constrained  memory 
model  consists  of  independently  computing  the  minimum  number  of  registers  required 
to  implement  each  node  U  EU  and  adding  these  results  for  all  nodes  in  U.  Using  (3.10) 
to  compute  the  number  of  registers  required  to  implement  each  node,  we  get 

[max) ' 

F,U 

N 

where  is  computed  as  in  (3.8). 


Ro='Z 

ueu 


Example  3.2  Consider  the  scheduled  biquad  filter  in  Figure  3.5(b).  Recall  the  assump¬ 
tions  that  addition  and  multiplication  require  1  and  2  units  of  time,  respectively  (i.e., 
Ta  =  I  and  Tm  =  2),  and  one-stage  pipelined  adders  and  two-stage  pipelined  multipliers 
are  available  (i.e.,  Pa  =  1  and  Pm  —  2).  Table  3.2  shows  the  number  of  registers  required 
to  individually  implement  each  node.  For  example,  the  five  arcs  which  are  output  from 
node  Ai  have  1,  2,  3,  4,  and  6  folded  arc  delays.  Since  max{l,2,3,4, 6}  =  6,  node  Ai 
requires  [6/4]  =2  registers.  By  adding  the  values  in  Table  3.2,  we  find  Rq  =  8,  i.e., 
8  registers  are  required  to  implement  the  biquad  filter  shown  in  Figure  3.5(b)  using  the 
operation- constrained  memory  model.  □ 


The  operation-constrained  memory  model  is  suboptimal  with  respect  to  minimization 
of  registers  since  the  registers  are  often  underutilized.  For  example,  consider  nodes 
A3  and  A4  in  Figure  3.5(b).  These  two  nodes  belong  to  folding  set  Si  so  they  are 
executed  by  the  same  processor,  which  is  a  one-stage  pipelined  adder.  The  outputs  of 
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Table  3.2:  The  number  of  registers  required  to  implement  the  nodes  of  the  biquad  filter 
individucdly. 


Node  U 

f  ^(mQr)1 

N 

Ai 

2 

>^3 

1 

Aa 

1 

Ml 

1 

M2 

1 

M3 

1 

Ma 

1 

this  adder  due  to  ^43  and  A4  must  be  delayed  by  1  time  unit  since  using  (3.1)  we  find  that 
Df{Az  — >  .i4i)  =  1  and  Df{A4  A2)  =  1  in  Figure  3.5(b).  Since  the  variables  resulting 
from  operation  A3  are  live  during  time  units  4/  +  3  and  the  variables  resulting  from  A4 
are  live  during  time  units  4/  +  1,  these  outputs  could  share  the  same  register;  however, 
under  the  operation-constrained  memory  model,  each  of  the  nodes  A3  and  A4  requires 
a  separate  register.  This  particular  underutilization  problem  could  be  eliminated  by 
allowing  all  variables  which  are  output  from  the  same  processor  to  share  registers,  which 
leads  to  the  processor-constrained  memory  model. 

The  Processor-Constrained  Memory  Model 

In  the  processor-constrained  memory  model,  each  processor  in  the  synthesized  hardware 
is  allocated  a  unique  set  of  registers.  The  only  variables  which  are  allowed  to  occupy  the 
registers  allocated  to  a  processor  are  those  variables  which  are  output  from  that  particular 
processor.  As  a  result,  register  minimization  under  the  processor-constrained  memory 
model  consists  of  individually  computing  the  minimum  number  of  registers  required 
to  allocate  the  outputs  of  each  processor  and  adding  these  results  for  all  processors. 
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Recall  that  the  nodes  (i.e.,  operations)  which  are  executed  by  the  same  processor  belong 
to  the  same  folding  set.  The  processor-constrained  memory  model  is  less  restrictive 
than  the  operation-constrained  memory  model  since  the  processor-constrained  model 
allows  outputs  from  the  nodes  in  a  folding  set  to  share  registers  in  the  synthesized 
hardware,  while  the  operation-constrained  memory  model  allows  no  memory  sharing 
among  variables  produced  by  different  nodes.  To  determine  the  number  of  registers 
required  to  implement  all  nodes  in  a  folding  set,  we  must  compute  the  number  of  live 
variables  due  to  the  nodes  in  the  folding  set  for  each  time  partition  n  G  [0,  N). 


For  each  node  U  €U,  vfe  must  first  compute  using  (3.8).  The  number  of  live 

variables  due  to  node  U  in  time  partition  n  can  be  found  by  substituting  p^o  =  ^  + 
into  (3.9)  to  get 


riive,u{n) 


'n-{u  +  Pljy 

n  —  {u  +  Py  +  Dpu  ^ ) 

N 

N 

(3.11) 


Let  5i,  52, . . . ,  5s  be  the  folding  sets  in  G.  Note  that  s  is  the  number  of  folding  sets  in 
G,  which  is  equivalent  to  the  number  of  processors  in  the  folded  realization  of  G.  The 
number  of  live  variables  in  time  partition  n  G  [0,  N)  due  to  all  G  G  5fc  is 


(^)  —  ^  !  ‘^live,U  {‘^)  1 

ueSk 

and  the  number  of  registers  required  to  implement  all  nodes  G  G  5*,  is 


{max) 
'  live,Sii 


{nive,s,(n)}- 

ne(0,Af) 


The  minimum  number  of  registers  required  to  implement  G  using  the  processor-con- 
strained  memory  model  is 


jfc=i 


k=l 


max 

ne[0,N) 


^  ^  '^live,u(P'') 
,ueSk 
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Table  3.3:  The  number  of  live  variables  at  the  output  of  each  operator  of  the  folded 
biquad  filter  for  all  possible  time  partitions. _ 


time 

Si{+) 

52(x) 

0 

2 

2 

1 

3 

1 

2 

1 

2 

3 

2 

1 

Example  3.3  For  the  biquad  filter  in  Figure  3.5(b),  the  number  of  registers  required  to 
delay  the  outputs  of  the  adder  is  =  3  and  the  number  of  registers  required  to  delay 

the  outputs  of  the  multiplier  is  =2.  As  a  result,  Rp  =  5,  i.e.,  5  registers  are 

required  to  implement  the  folded  biquad  filter  using  the  processor-constrained  memory 
model. 

The  processor-constrained  memory  model  may  not  result  in  the  minimum  number  of 
registers  because  variables  which  are  output  from  different  processors  are  not  allowed  to 
share  registers.  Table  3.3  shows  the  number  of  live  variables  for  the  scheduled  biquad 
filter  in  Figure  3.5(b)  for  the  folding  sets  (adder)  and  S2  (multiplier)  during  each 
time  partition.  The  total  number  of  live  variables  during  any  time  partition  can  be 
found  by  simply  adding  the  number  of  live  variables  due  to  Si  and  S2  for  that  time 
partition.  Notice  that  the  maximum  number  of  live  variables  in  any  time  partition  is  4 
even  though  we  computed  in  Example  3.3  that  the  folded  implementation  requires  5 
registers  using  the  processor-constrained  memory  model.  This  demonstrates  that  the 
processor-constrained  memory  model  may  not  achieve  global  optimality  with  respect  to 
register  minimization;  however,  this  may  still  result  in  an  efficient  architecture  due  to 
local  interconnection. 
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The  Unconstrained  Memory  Model 


In  the  unconstrained  memory  model,  each  variable  can  be  stored  in  any  register  in 
the  synthesized  hardware,  regardless  of  the  node  in  the  DFG  or  the  processor  in  the 
synthesized  hardware  from  which  the  variable  originates.  The  minimum  number  of 
registers  required  under  the  unconstrained  memory  model  is  computed  by  taking  the 
maximum  of  the  total  number  of  live  variables  in  G  over  one  period  of  operation,  which 
can  be  written  as 

Ru  =  rn^  {  Y'  rii^e.uin)  >  ,  (3.12) 

where  (3.8)  and  (3.11)  are  used  to  compute  rii^e,u{'<^)-  The  quantity  Ry  represents  the 
theoretical  lower  bound  on  the  number  of  registers  required  to  implement  G. 

Example  3.4  Table  3.4  lists  the  value  of  ruyg^ui''^)  for  all  nodes  U  €  U  and  all  time 
partitions  n  €  [0,N)  for  the  biquad  filter  in  Figure  3.5(b).  The  number  of  live  vari¬ 
ables  for  each  time  partition  can  be  found  by  taking  the  sum  of  each  column,  i.e.,  these 
values  for  time  partitions  0,  1,  2,  and  3  are  4,  4,  3,  and  3,  respectively.  The  mini¬ 
mum  number  of  registers  required  using  the  unconstrained  memory  model  is  Ru  =  4 
since  max{4,4, 3, 3}  =  4.  Recall  that,  for  this  example,  the  operation- constrained  mem¬ 
ory  model  required  8  registers  and  the  processor- constrained  memory  model  required  5 
registers.  □ 

To  determine  the  computational  complexity  of  computing  Ru  in  (3.12),  let  m  be  the 
number  of  nodes  in  G.  Clearly,  the  number  of  nodes  U  Eli  cannot  be  greater  than  m.  If 
we  assume  the  maximum  number  of  inputs  to  any  node  is  a  constant  that  is  independent 
of  m,  then  the  number  of  arcs  in  G  grows  linearly  with  m,  and  in  (3.8)  can  be 

computed  ior  U  Ell  in  0{m)  time.  The  maximum  number  of  nodes  in  G  that  can  be 
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Table  3.4:  The  number  of  live  variables  due  to  each  node  in  the  biquad  filter  for  all 
possible  time  partitions. _ 


n  =  0 

n  =  1 

n  =  2 

n  =  3 

'^live,Ai  (^) 

2 

2 

1 

1 

flive,Ai  (^) 

0 

0 

0 

1 

'^live^Ai  (^) 

0 

1 

0 

0 

‘^live,M\  (^) 

0 

0 

1 

0 

(^) 

1 

0 

0 

0 

n»ue,Ma  (^) 

0 

1 

1 

0 

^Hve,M4  (^) 

1 

0 

0 

1 

'HueU  nit)e,t/(^) 

4 

4 

3 

3 

executed  by  a  single  processor  is  m  (the  uniprocessor  case),  so  N  <  m  holds.  Then 
ill  (3-11)  can  be  computed  for  f/  GW  and  n  G  [0,  AT)  in  O(m^)  time.  The 
summation  in  (3.12)  represents  0{rn?)  additions,  and  finding  the  maximum  in  (3.12) 
requires  0(m)  comparisons.  Therefore,  Ru  can  be  computed  for  an  arbitrary  DFG  with 
m  nodes  in  0{m?)  time. 

3.3.3  Comparison  of  Memory  Models 

Table  3.5  compares  the  number  of  registers  required  for  several  benchmark  filters  under 
the  various  memory  models.  The  benchmarks  used  are  the  fourth-order  all-pole  lat¬ 
tice  filter  mentioned  in  [59]  (FI),  the  fifth-order  wave  digital  elliptic  filter  introduced 
in  [47]  (F2),  the  fourth-order  Jaumann  wave  digital  filter  mentioned  in  [60]  (F3),  the 
four-stage  pipelined  lattice  filter  [61]  (F4),  and  the  biquad  filter  shown  in  Figure  3.5(a) 
(F5).  These  filters  were  scheduled  using  the  MARS  system  [26].  Notice  from  Table  3.5 
that  Ru  <  Rp  <  Ro  for  all  of  these  filters,  which  appeals  to  our  intuition  since  the 
operation-constrained  memory  model  has  the  most  restrictions  on  memory  shading  while 
the  unconstrained  memory  model  has  no  restrictions  on  memory  sharing. 

It  is  important  to  note  that  the  three  memory  models  considered  in  Section  3.3.2  are 
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Table  3.5:  Register  count  using  various  memory  models.  The  benchmark  filters  used  are 
fourth-order  lattice  filter  (FI),  fifth-order  wave  digital  elliptic  filter  (F2),  fourth-order 
Jaumann  filter  (F3),  four-stage  pipelined  lattice  filter  (F4),  and  biquad  filter  shown  in 
Figure  3.5(a)  (F5).  N  is  the  iteration  period. 


N 

1^31 

FI 

B 

F2 

16 

34 

12 

10 

F3 

9 

B 

F4 

2 

29 

20 

18 

F5 

4 

8 

5 

representative  of  the  various  models  which  can  be  chosen.  New  memory  models  can  be 
defined  as  needed,  and  expressions  can  be  derived  for  the  minimum  number  of  registers 
for  these  models  using  the  same  approach  as  used  in  Section  3.3.2. 

While  Table  3.5  gives  the  number  of  required  registers  using  the  three  memory  models 
described  in  Section  3.3.2,  there  are  side-effects  which  are  not  shown  in  the  table.  For 
example,  decreasing  the  number  of  registers  by  using  the  unconstrained  model  typically 
increases  the  number  of  multiplexers  required  to  allocate  data  to  these  registers,  and 
the  overall  effect  of  using  fewer  registers  may  actually  be  an  increase  in  area  due  to  the 
area  of  the  multiplexers.  As  a  result,  the  number  of  registers  cannot  be  considered  to 
be  the  sole  cost  of  the  circuit,  and  several  memory  models  may  need  to  be  evaluated  to 
determine  the  best  one  for  a  given  application. 

3.4  Memory  Minimization  Using  Retiming 


The  derivations  in  Section  3.3  are  based  on  the  assumption  that  the  DFG  has  been 
scheduled  and  no  more  circuit  transformations  are  to  be  performed  on  the  DFG.  In 
this  section,  we  consider  optimal  retiming  of  the  DFG  after  scheduling  so  the  resulting 
implementation  uses  the  minimum  number  of  registers  under  the  unconstrained  memory 
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model. 


Retiming  is  often  used  to  reduce  the  critical  path  or  minimize  the  number  of  delays 
in  a  circuit  [27].  Retiming  has  also  been  used  for  scheduling  [11],  [12],  [26].  This  section 
deals  with  using  retiming  to  minimize  the  number  of  registers  in  the  hardware  realization 
of  a  statically  scheduled  DFG.  Of  course,  the  retiming  must  always  maintain  the  validity 
of  the  schedule  by  keeping  Df{U  ->  V)  >  1  for  all  arcs  U  V  so  the  resulting  DFG  is 
realizable. 

The  problem  of  minimizing  the  number  of  delays  in  a  scheduled  DFG  is  not  analogous 
to  minimizing  the  number  of  registers  required  by  the  hardware  realization  of  the  DFG. 
For  example,  the  DFG  in  Figure  3.8(a)  contains  3  delays  and  its  hardware  realization 
requires  5  registers  using  the  unconstrained  memory  model  when  we  assume  an  iteration 
period  oi  N  =  2  and  that  all  hardware  processors  are  pipelined  by  P  =  1  stage.  The 
folding  orders  are  indicated  next  to  the  nodes.  A  retimed  version  of  the  DFG  is  shown  in 
Figure  3.8(b),  where  the  retiming  values  r(l)  =  0,  r(2)  =  0,  and  r(3)  =  1  are  used.  This 
retimed  DFG  contains  4  delays  and  its  hardware  realization  requires  4  registers  using 
the  unconstrained  memory  model.  From  this  example,  we  see  that  use  of  retiming  to 
decrease  the  number  of  delays  in  the  DFG  can  actually  increase  the  number  of  registers 
required  to  implement  the  DFG  in  hardware. 

Recall  that  arc  C/  ->  V  in  Figure  3.4  is  folded  using  (3.1).  Using  retiming,  the  number 
of  delays  in  arc  U  V  can  be  changed  from  i  to 

i^  =  i  +  r{V)-r{U),  (3.13) 

where  v  is  the  number  of  delays  in  arc  17  — )■  U  in  the  retimed  algorithm  DFG,  and  r(X) 
denotes  the  retiming  value  of  node  X  [27].  Let  D'p{U  -)■  V)  denote  the  number  of  folded 
arc  delays  obtained  by  folding  arc  U  -^V  in  the  retimed  algorithm  DFG.  To  ensure  that 
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(a)  (b) 

Figure  3.8;  (a)  A  scheduled  DFG  which  has  3  delays  and  whose  hardware  requires  5 
registers,  (b)  A  retimed  version  of  the  DFG  which  has  4  delays  and  whose  hardware 
requires  4  registers.  For  both  parts,  an  iteration  period  of  2  is  assumed  and  all  nodes 
are  mapped  to  processors  with  one  pipelining  stage. 

the  corresponding  arc  in  the  folded  hardware  DFG  has  a  nonnegative  number  of  delays, 
we  must  force  the  constraint  D'p{U  ^  V)  >  1,  which  is  equivalent  to 


Nir  —  Py+V  —  u— 1>0. 


(3.14) 


This  constraint  ensures  that  the  schedule  which  was  determined  prior  to  retiming  is  also 
valid  after  retiming.  Since  the  retiming  values  for  the  nodes  are  restricted  to  be  integers, 
(3.13)  and  (3.14)  can  be  combined  as  in  [28]  to  obtain 

Df{U  7)  -  1 


r{U)-r{V)  < 


N 


(3.15) 


where  [xj  is  the  floor  of  x,  which  denotes  the  largest  integer  less  than  or  equal  to  x.  Once 
the  set  of  constraints  for  the  DFG  is  found  using  (3.15)  (there  is  one  such  constraint 
for  each  arc  in  the  algorithm  DFG),  a  solution  must  be  found  using  an  appropriate 
technique.  We  consider  an  ILP  formulation  that  satisfies  the  constraints  while  minimizing 
the  number  of  registers  required  to  implement  the  folded  hardware  DFG. 


In  addition  to  the  constraints  specified  by  (3.15),  the  ILP  technique  must  also  use 
constraints  to  find  the  maximum  values  in  (3.8)  and  (3.12).  We  refer  to  this  formulation 
as  Minimum  Physical  Storage  Location  (MPSL)  retiming,  which  is  summarized  below. 
The  set  of  equations  in  Step  (II)  of  MPSL  retiming  are  similar  to  those  used  in  [21]. 
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MPSL  retiming:  Minimize  Ru  subject  to 
(I)  WeU  and  W  G  Vu 


riU)-r{V)  < 


Df{U  F)  -  1 
N 


(II)  'iUeU  and  W  e  Vu 

>  DpiU  ->  F)  +  Ar(r(F)  -  r{U)) 


(III)  Vue  [0,iV) 


ueu 


( 

'n-  (u  +  Py)' 

n-(u  +  P^)-D;(7")' 

] 

[ 

N 

N 

) 

Consider  the  biquad  filter  shown  in  Figure  3.5(a).  Assume  Ta  =  1,  Tm  =  2,  Pa  =  1, 
and  Pm  =  2.  The  iteration  bound,  i.e.,  the  lower  bound  on  the  achievable  iteration 
period,  is  4  units  [60],  [62],  and  we  consider  scheduling  the  DFG  so  that  the  iteration 
period  is  equal  to  the  iteration  bound.  Using  the  schedule  found  by  the  MARS  system, 
the  MPSL  formulation  retimes  the  DFG  such  that  the  minimum  number  of  registers 
required  to  implement  the  biquad  filter  using  the  unconstrained  memory  model  is  4. 
One  such  retiming  of  the  schedule  is  shown  in  Figure  3.5(b)  (recall  that  Ru  =  4  was 
computed  for  Figure  3.5(b)  in  Example  3.4).  Figure  3.9  shows  the  complete  synthesized 
hardware  for  the  DFG  in  Figure  3.5(b).  Notice  that  register  Ri  is  not  utilized  in  time 
partition  2  and  Ri  is  not  utilized  in  time  partition  3.  This  underutilization  can  also  be 
seen  in  Table  3.4  where  the  sum  of  the  n  =  2  and  n  =  3  columns  are  each  equal  to  3, 
so  that  only  3  of  the  four  registers  are  utilized  during  time  partitions  2  and  3.  In  spite 
of  this  underutilization,  the  DFG  in  Figure  3.5(b)  uses  the  minimum  possible  number  of 
registers  for  the  given  schedule. 

The  MPSL  retiming  problem  was  solved  using  the  ILP  solver  GAMS  [63].  We  note 
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Figure  3.9:  The  complete  synthesized  hardware  for  the  scheduled  biquad  filter  in  Fig¬ 
ure  3.5(b).  D  and  FU  represent  word-size  registers. 

that  in  some  cases,  GAMS  found  an  integer  solution  which  it  could  not  prove  was  optimal. 
In  these  cases,  we  proved  that  the  solution  was  optimal  by  showing  that  there  is  a 
time  partition  for  which  no  better  solution  exists.  When  applying  MPSL  retiming  to 
the  schedules  obtained  by  MARS,  we  found  that  MPSL  retiming  did  not  reduce  the 
number  of  required  registers  compared  to  the  retiming  performed  by  MARS,  i.e.,  for  the 
five  benchmark  filters  we  considered,  the  MARS  system  optimally  retimed  the  filters  in 
terms  of  the  number  of  registers  required  under  the  unconstrained  memory  model  for 
the  schedules  generated.  Although  this  result  suggests  that  the  retiming  performed  by 
MARS  is  good,  it  says  nothing  about  the  quality  of  the  schedules  obtained  by  MARS 
with  respect  to  memory  requirements. 

To  determine  how  the  scheduling  technique  used  by  the  MARS  design  system  performs 
in  terms  of  minimizing  the  required  number  of  registers,  the  MARS  schedules  were 
compared  to  globally  optimal  results.  To  determine  optimal  results  in  terms  of  the 
number  of  registers,  an  ILP  model  is  used  which  schedules  a  DFG  by  first  minimizing 
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Table  3.6:  Register  count  for  the  benchmark  filters  described  in  Table  3.5.  N  is  the 


Filter 

N 

MARS  schedule 
using  MARS  retiming 

MARS  schedule 
using  MPSL  retiming 

ILP 

schedule 

FI 

10 

6 

6 

5 

F2 

16 

10 

10 

10 

F3 

10 

7 

7 

6 

F4 

2 

18 

18 

18 

F5 

4 

4 

4 

4 

the  number  of  processors  and  then  minimizing  the  number  of  registers,  as  in  [37].  The 
results  are  shown  in  Table  3.6,  where  parameters  Ta  =  I,  Pa  =  1,  =  2,  and 

Pm  2  are  assumed.  First,  the  table  shows  that  MPSL  retiming  does  not  change 
the  number  of  registers  required  by  the  MARS  schedules.  The  table  also  shows  that 
the  schedules  obtained  from  the  MARS  system  are  optimal  or  near-optimal  in  terms  of 
register  requirements  for  the  five  benchmark  filters. 

Example  3.5  Figure  3.10(a)  shows  a  retimed  version  of  the  fifth-order  wave  digital 
elliptic  filter  given  in  [4  7].  The  filter  has  been  retimed  using  the  MPSL  retiming  according 
to  the  schedule  in  Table  3.7  generated  using  the  MARS  system.  Figure  3.10(b)  shows 
the  synthesized  architecture  which  uses  10  registers.  The  10  registers  are  denoted  as  Ri, 
and  the  internal  pipeline  delay  of  the  multiplier,  which  cannot  be  shared  by  other  data 
paths,  is  denoted  as  D.  Note  that  parameters  Ta  =  I,  Pa  =  1,  Tm  =  2,  and  Pm  =  2 
are  assumed,  and  the  iteration  period  of  the  hardware  is  16  units,  which  is  the  iteration 
bound  for  the  parameters  assumed. 
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Table  3.7:  The  schedule  from  the  MARS  system  for  the  fifth-order  wave  digital  elliptic 


filter. 


node 

1 

2 

3 

4 

5 

6 

7 

folding 

(set|order) 

(5i|14) 

(5i|0) 

(5illl) 

(5i|15) 

(54|12) 

(5i|10) 

(52|1) 

node 

8 

9 

10 

11 

12 

13 

14 

folding 

(set|order) 

iSi\7) 

(sm 

(54|8) 

(52|12) 

(52|15) 

(52|0) 

(54|13) 

node 

15 

16 

17 

18 

19 

20 

21 

folding 
(set  1  order) 

(5i|6) 

iSi\2) 

(5i|3) 

(52|7) 

(52|8) 

(53|7) 

(54 14) 

node 

22 

23 

24 

25 

26 

27 

28 

folding 

(set|order) 

(54|5) 

(53|8) 

(53|2) 

(53|11) 

(53112) 

(54|9) 

(53|13) 

node 

29 

30 

31 

32 

33 

34 

folding 

(set|order) 

iSz\0) 

(53|1) 

(54114) 

(5i|12) 

(5i|l) 

(54|15) 

3.5  Conclusions 


Efficient  use  of  memory  in  application-specific  architectures  for  DSP  is  very  important 
in  order  to  meet  design  specifications.  Inefficient  use  of  memory  can  result  in  inefficient 
designs  due  to  effects  such  as  increased  area  and  increased  power  consumption. 

We  have  derived  closed-form  expressions  for  the  minimum  number  of  registers  re¬ 
quired  by  a  statically  scheduled  DSP  program  for  the  operation-constrained,  processor- 
constrained,  and  unconstrained  memory  models.  We  first  derived  expressions  for  the 
minimum  number  of  registers  under  the  operation-constrained  and  processor-constrained 
models,  and  we  demonstrated  via  the  biquad  filter  example  why  these  memory  models 
are  not  optimal  in  terms  of  the  number  of  registers  required.  We  then  derived  the  expres¬ 
sion  for  the  minimum  number  of  registers  under  the  unconstrained  memory  model.  This 
expression,  which  gives  the  theoretical  lower  bound  on  the  number  of  registers  required 
to  implement  a  statically  scheduled  DSP  program,  can  be  computed  in  0{vn?‘)  time  for 
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a  DFG  with  m  nodes.  The  techniques  we  used  in  our  derivations  can  also  be  used  to  de¬ 
termine  expressions  for  lower  bounds  on  memory  requirements  for  other  memory  models 
not  discussed  in  the  chapter.  The  results  in  this  chapter  axe  most  applicable  to  dedicated 
application-specific  hardware;  however,  we  believe  that  these  results  can  also  be  applied 
to  other  technologies,  such  as  FPGA-based  designs. 

We  also  considered  retiming  to  minimize  memory  requirements  of  a  statically  sched¬ 
uled  DFG.  The  MPSL  retiming  formulation  uses  integer  linear  programming  techniques 
to  determine  the  optimal  retiming  of  the  DFG  in  terms  of  memory  required  under  the 
unconstrained  memory  model  while  maintaining  the  validity  of  the  schedule.  We  used 
MPSL  retiming  to  verify  that  retiming  performed  by  the  MARS  system  is  optimal  for 
the  benchmark  filters  we  considered.  We  then  compared  memory  requirements  of  sched¬ 
ules  obtained  by  MARS  to  schedules  obtained  using  integer  linear  programming  which 
are  optimal  in  terms  of  required  memory  under  the  unconstrained  memory  model.  Our 
results  show  that  the  schedules  obtained  by  MARS  are  optimal  or  close  to  optimal  in 
terms  of  memory  requirements. 

The  evaluation  of  the  schedules  obtained  by  MARS  demonstrates  how  the  techniques 
presented  in  this  chapter  can  be  used  for  evaluation  of  high-level  synthesis  systems.  These 
techniques  can  be  used  for  design  and  evaluation  throughout  the  high-level  synthesis 
process. 
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Figure  3.10:  (a)  Fifth-order  wave  digital  elliptic  filter.  The  DFG  has  been  retimed 
using  MPSL  retiming  to  minimize  the  number  of  registers  required  given  the  schedule 
generated  by  the  MARS  system  (see  Table  3.7).  (b)  Synthesized  hardware  using  the 
minimum  possible  iteration  period  of  16  and  the  theoretical  lower  limit  of  10  registers. 
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Chapter  4 

Multirate  Folding 

4.1  Introduction 

The  widespread  use  of  digital  representation  of  signals  for  transmission  and  storage  has 
created  challenges  in  the  area  of  digital  signal  processing  (DSP).  In  response  to  these 
challenges,  new  DSP  algorithms  have  emerged  for  tasks  such  as  compression  and  filtering 
of  digital  signals.  Many  of  these  algorithms  are  multirate  in  nature,  meaning  that  the 
sample  rate  is  not  constant  throughout  the  algorithm  description  [5].  While  the  theory  of 
multirate  DSP  has  matured  over  the  past  decade,  there  has  been  relatively  little  research 
on  the  topic  of  designing  efficient  real-time  architectures  for  multirate  systems.  This  has 
resulted  in  a  lack  of  CAD  tools  that  can  translate  multirate  algorithms  into  efficient 
VLSI  architectures. 

Considerable  work  has  been  done  in  the  area  of  scheduling  multirate  DSP  algorithms 
and  constructing  efficient  DSP  code  for  these  algorithms  [55,  64,  57,  65,  66].  The  topic 
of  this  chapter  is  multirate  folding  [36],  which  is  a  technique  for  systematically  synthesiz¬ 
ing  control  circuits  for  single-rate  architectures  which  implement  multirate  algorithms. 
Throughout  this  chapter,  the  term  single-rate  architecture  is  used  to  describe  a  syn¬ 
chronous  architecture  where  the  entire  architecture  operates  with  the  same  clock  period. 
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Examples  of  data-flow  graphs  (DFGs)  describing  multirate  DSP  algorithms  are  shown  in 
Figure  4.1.  The  DFGs  in  Figure  4.1  are  multirate  due  to  decimation  by  2  (J,  2  block  which 
discards  every  other  sample)  and  expansion  by  2  (t  2  block  which  inserts  a  zero  between 
each  adjacent  pair  of  samples),  which  respectively  halve  and  double  the  sample  rate  of  a 
signal.  A  direct  mapping  of  a  multirate  DSP  algorithm  to  hardware  would  require  data 
to  move  at  different  rates  on  the  chip.  This  would  require  routing  and  synchronization  of 
multiple  clock  signals  on  the  chip.  To  avoid  these  problems,  we  concentrate  on  mapping 
the  multirate  DSP  programs  to  single-rate  VLSI  architectures. 

The  advantages  of  multirate  folding  fall  into  two  broad  categories.  The  first  advantage 
is  that  the  multirate  folding  equations  can  be  used  to  systematically  determine  the 
control  circuitry  for  the  architecture  from  a  scheduled  DFG.  The  second  advantage, 
which  is  slightly  more  subtle,  is  that  this  formal  approach  can  be  used  to  address  other 
related  problems  in  high-level  synthesis  in  a  formal  manner.  Two  such  problems,  memory 
minimization  and  retiming  [27],  are  considered  in  this  chapter.  Using  the  multirate 
folding  equations,  we  derive  expressions  for  the  minimum  number  of  registers  required 
to  implement  the  architectures,  and  we  derive  constraints  for  retiming  the  circuit  such 
that  a  given  schedule  is  valid. 

We  first  introduced  multirate  folding  in  [36]  as  a  technique  for  synthesizing  archi¬ 
tectures  for  tree-structured  filter  banks.  Full  and  pruned  tree-structured  filter  banks 
are  useful  for  many  DSP  applications,  such  as  signal  coding  and  analysis.  Recent  in¬ 
terest  in  the  discrete  wavelet  transform  (DWT)  has  significantly  increased  the  number 
of  applications  for  tree-structured  filter  banks  since  the  DWT  can  be  computed  using 
a  pruned  tree-structured  filter  bank  [42,  41,  43,  44].  Computation  of  wavelet  packet 
bases  is  another  application  of  pruned  tree-structured  filter  banks  [45].  Full  binary  tree- 
structured  filter  banks  for  signal  analysis  and  synthesis  are  shown  in  parts  (a)  and  (b) 
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of  Figure  4.1.  Pruned  binaxy  tree-structured  filter  banks  which  represent  analysis  and 
synthesis  structures  for  the  discrete  wavelet  transform  (DWT)  are  shown  in  parts  (c)  and 
(d)  of  Figure  4.1.  Multirate  folding  can  be  used  to  synthesize  architectures  for  each  of  the 
four  filter  banks  in  Figure  4.1.  In  Section  4.6,  we  give  a  detailed  example  which  shows 
how  the  techniques  presented  in  this  chapter  can  be  used  to  design  an  architecture  for 
the  three- level  discrete  wavelet  transform  analysis  filter  bank  as  shown  in  Figure  4.1(c). 
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Figure  4.1:  Examples  of  full  and  pruned  binary  tree-structured  filter  banks,  (a)  Full-tree 
analysis  filter  bank,  (b)  Full-tree  synthesis  filter  bank,  (c)  Pruned-tree  analysis  filter 
bank  which  can  be  used  to  compute  the  DWT.  (d)  Pruned-tree  synthesis  filter  bank 
which  can  be  used  to  compute  the  inverse  DWT. 


The  main  properties  of  multirate  folding  are  summarized  below: 

•  Multirate  folding  is  a  novel  technique  for  synthesizing  control  circuits  for  single-rate 
architectures  which  implement  multirate  DSP  algorithms. 

•  The  multirate  folding  equations  allow  us  to  address  other  problems  in  high-level 
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synthesis,  such  as  memory  minimization  and  retiming. 


•  Multirate  folding  operates  directly  on  the  multirate  DFG,  avoiding  the  step  of  first 
constructing  an  equivalent  single-rate  algorithm  description. 

•  Multirate  folding  accounts  for  pipelining,  so  architectures  can  be  designed  for  high 
speed  and  low  power  [67]  applications. 

•  Multirate  folding  is  applicable  to  a  wide  variety  of  DSP  algorithms.  We  demon¬ 
strate  its  utility  by  designing  a  discrete  wavelet  transform  architecture. 

The  chapter  is  organized  as  follows.  Section  4.2  reviews  some  fundamentals  of  mul¬ 
tirate  digital  signal  processing.  In  Section  4.3,  we  derive  the  folding  equations  which 
are  used  to  systematically  synthesize  the  control  circuits  for  the  pipelined  architectures. 
Retiming  for  multirate  folding  is  addressed  in  Section  4.4.  Memory  requirements  for 
the  folded  architectures  are  addressed  in  Section  4.5,  and  the  discrete  wavelet  transform 
design  example  is  given  in  Section  4.6.  Our  conclusions  are  stated  in  Section  4.7. 

4.2  Some  Multirate  DSP  Fundamentals 

This  section  provides  a  review  of  some  multirate  DSP  fundamentals  which  are  used 
throughout  the  chapter. 

Multirate  DSP  algorithm  descriptions  contain  decimators  and/or  expanders.  Fig¬ 
ure  4.2  shows  a  decimator  and  an  expander,  which  obey  the  input-output  relationships 
yoin)  =  x{Mn)  and 

.  ,  _  J  x{-^)  if  n  is  a  multiple  of  M 
yEyn)  I  Q  otherwise 

Note  that  we  use  the  term  expander  rather  than  interpolator  to  describe  the  block  in 
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Figure  4.2(b)  since  interpolation  generally  implies  expansion  followed  by  filtering.  The 
decimator  and  expander  both  have  the  effect  of  changing  the  sample  rate. 

x(n) - -iM — -  y^Cn)  x(n) - -fM — -  ygCn) 

(a)  (b) 

Figure  4.2:  (a)  Decimation  by  M .  (b)  Expansion  by  M. 

The  noble  identities  are  useful  for  theory  and  implementation  of  multirate  DSP  [5]. 
Special  cases  of  these  identities  are  shown  in  Figure  4.3.  These  relationships  are  used  in 
Section  4.4  to  derive  conditions  for  retiming  a  multirate  DFG  for  folding. 


(b) 

Figure  4.3:  Redistribution  of  delays  in  a  multirate  system  using  the  noble  identities. 

4.3  Derivation  of  Folding  Equations 

Folding  is  a  technique  for  systematically  determining  control  circuits  in  architectures 
where  multiple  algorithm  operations  (such  as  addition  operations)  are  time-multiplexed 
to  a  single  hardware  module  (such  as  a  pipelined  ripple-carry  adder)  [28].  The  folding 
transformation  is  similar  to  loop  folding  [68]  which  has  been  used  in  high-level  synthesis. 
Figure  4.4  shows  an  example  of  how  folding  can  be  used  to  time-multiplex  two  algorithm 
operations  to  a  single  hardware  operator.  Folding  equations  have  been  derived  in  the  past 
for  folding  single-rate  algorithms  to  single-rate  architectures,  and  for  folding  single-rate 
algorithms  to  multirate  architectures  [28].  In  this  section,  we  review  folding  of  single-rate 


91 


algorithms  to  single-rate  architectures,  and  then  derive  equations  for  folding  multirate 
algorithms  to  single-rate  architectures. 


Figure  4.4:  (a)  A  simple  single-rate  DSP  algorithm  with  two  addition  operations,  (b) 
A  folded  architecture  where  the  two  addition  operations  are  folded  to  a  single  hardware 
adder  with  one  stage  of  pipelining. 

4.3.1  Single-Rate  Folding 

Consider  an  arc  (also  referred  to  as  an  edge)  connecting  nodes  U  and  V  with  i  delays, 
as  in  Figure  4.5(a).  Let  the  l-th  iteration  of  nodes  U  and  V  be  scheduled  to  execute 
at  time  units  Nijl  +  u  and  Nyl  +  v,  respectively,  where  u  and  v  are  the  folding  orders 
of  nodes  U  and  V  which  satisfy  u  E  [0,  ATt/)  and  v  £  [0,  A^k)-  The  hardware  operators 
(also  referred  to  as  functional  units)  which  execute  nodes  U  and  V  are  denoted  as  JIc/ 
and  Hy,  respectively.  Note  that  Na  and  Ny  number  of  operations  are  folded  to  Hu  and 
Ily,  respectively.  If  H(/  is  pipelined  by  Py  stages,  then  the  result  of  the  l-th  iteration  of 
node  U  is  available  at  Nul  +u  -h  Pu-  Since  arc  U  ^  V  has  i  delays,  the  result  of  node 
U  is  used  by  the  (/  +  i)-th  iteration  of  V,  which  is  executed  at  Ny{l  -I-  i)  -I-  v.  Therefore, 
the  result  must  be  stored  for 

Dp{U  — >  V)  =  Ny{l  i)  +  V  —  {Nijl  -t-  Pu  -f  u)  =  {Ny  —  Nu)l  4-  Nyi  —  Pu  v  —  u 

time  units.  Since  we  assume  that  DSP  programs  iterate  from  i  =  0  to  /  =  oo,  practical 
concerns  require  Nu  =  Ny  to  avoid  the  cases  where  Dp{U  V)  approaches  -foo  or 


92 


—00  as  I  gets  large.  With  N  =  Ny  =  Ny,  the  folding  equation  becomes 


Dp{U  — >•  V)  =  Ni  —  Pjj  +v  —  u,  (4.1) 


which  is  independent  of  the  iteration  number,  /.  Arc  U  V  maps  to  a  path  from  Hu 
to  Hv  in  the  architecture  with  Dp{U  — >  F)  delays,  and  data  on  this  path  are  input  to 
Hy  at  Nl  +  v,  as  illustrated  in  Figure  4.5(b). 


(5H 

/D 

— ®  ;(gh 

i\jDi 

-  E^CU-V) 

w  iVti-V 

(a) 

(b) 

Figure  4.5:  (a)  An  arc  f/  — >  V  with  i  delays,  (b)  The  corresponding  folded  arc. 


4.3.2  Multirate  Folding 

Multirate  folding  provides  a  systematic  technique  for  mapping  multirate  algorithms  to 
single-rate  hardware.  Folding  equations  are  first  derived  for  arcs  which  contain  decima- 
tors  and  then  for  arcs  which  contain  expanders. 

The  Folding  Equation  for  Arcs  Containing  Decimators 

Consider  the  arc  C/  ->  V  in  Figure  4.6(a),  where  the  output  of  node  U  passes  through  ii 
delays,  decimation  by  M,  and  12  delays  before  reaching  node  V.  Let  the  l-th.  iteration 
of  node  U  execute  at  time  unit  Nul  +  u  and  the  f-th  iteration  of  V  execute  at  Nyl  -I-  v, 
where  the  folding  orders  satisfy  u  €  [0,  ATf/)  and  v  e  [0,  iVv^). 

The  signals  labeled  in  Figure  4.6(a)  are  related  by 

wi{l)  = 

W2{1)  =  wi{Ml)  =  x{Ml  -  h) 
y{l)  =  W2{l-i2)-x{M{l-i2)-i\) 
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(a) 


(b) 


Figure  4.6:  (a)  An  arc  U  ^  V  which  contains  a  decimator.  (b)  The  corresponding  folded 
arc. 

which  implies  that  the  sample  y{l),  which  is  consumed  during  the  l-th  iteration  of  V,  is 
produced  during  the  {Ml  —  (Mi2  +ii))-th  iteration  of  U.  Sample  y{l)  is  consumed  by  Hy 
in  time  unit  Nyl  +  v  and  is  produced  by  Hu  in  time  unit  Nu{Ml  -  {Mi2+ii))  +  u.  If  Hu 
is  pipelined  by  Pu  stages,  theny(/)  is  available  at  time  unit  Nu{Ml-{Mi2+ii))+u+Pu. 
Therefore,  y{l)  must  be  stored  for 

Dp{U  — ^  V^)  =  Nyl  V  —  {Nu{Ml  —  {Mi2  +  u))  +  u  +  Pu) 

=  {Ny  —  MNu)l  +  Nu{Mi2  +  *i)  ~  Pu  +  V  —  u 

time  units.  As  in  the  single-rate  case,  we  would  like  this  expression  to  be  independent 
of  1.  This  can  be  achieved  by  forcing  Ny  =  MNu,  which  implies  that  node  U  executes 
M  times  for  each  execution  of  node  V.  This  is  intuitive  since  the  output  of  node  U 
is  decimated  by  M  before  reaching  node  V.  With  Ny  =  MNu,  the  folding  equation 
becomes 

D^{U  ~^V)  =  Nu{Mi2+ii)-Pu  +  v-u,  (4.2) 

which  is  independent  of  the  iteration  number,  1. 

Since  node  V  is  scheduled  to  execute  on  hardware  operator  Hy  at  time  units  Nyl  -t- 
V,  the  data  on  arc  f/  ->  V  are  input  to  Hy  at  time  units  Nyl  -I-  u  as  illustrated  in 
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Figure  4.6(b).  For  the  case  of  M  =  1,  i.e.,  where  the  decimator  does  not  affect  the  data 
stream,  I'l  and  i2  can  be  combined  as  i  =  ti  + 12,  and  Ny  =  Ny  =  N,  where  N  is  the 
iteration  period  of  nodes  U  and  V.  Substituting  these  expressions  into  (4.2)  gives  the 
single-rate  folding  equation  (4.1). 

The  Folding  Equation  for  Arcs  Containing  Expanders 

Consider  the  arc  U  V  \tl  Figure  4.7(a),  where  the  output  of  node  U  passes  through 
delays,  expansion  by  L.  and  12  delays  before  reaching  node  V.  Let  the  l-th  iteration 
of  node  U  execute  at  time  unit  N(jl  +  u  and  the  Z-th  iteration  of  V  execute  at  Nyl  +  v, 
where  the  folding  orders  satisfy  u  6  lO,Ncr)  and  v  6  [0,  ATv). 


(b) 


Figure  4.7:  (a)  An  arc  U  V  which  contains  an  expander,  (b)  The  corresponding 
folded  arc. 

The  signals  labeled  in  Figure  4.7(a)  are  related  by 

W2il)  =  yil  +  h) 
wi(l)  =  W2{LI)  =  y{Ll  +  12) 
x{l)  =  wi{l  +  ii)  =  y{L{l  +  ii)  +  i2) 

which  implies  that  sample  x{l),  which  is  the  output  of  the  Z-th  iteration  of  U,  is  used 
as  the  input  of  the  {L{1  -1-  Zi)  -f  Z2)-th  iteration  of  V.  Sample  x{l)  is  available  at  the 
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output  of  processor  Hu  at  time  unit  Nyl  +  u  +  Pu  and  is  consumed  by  Hy  at  time  unit 
Nv{L{l  +  ii)  +  ^2)  +  V,  so  x(l)  must  be  stored  for 

Dp{U  V)  =  N\r{L(l  +  ii)  +i2)  +  V  —  {Nul  +  u  +  Pu) 

=  {NyL  —  Nu)l  +  Ny{Lii  +  12)  —  Pu  +  V  —  u. 

For  this  expression  to  be  independent  of  /,  NyL  =  Nu  must  hold.  This  implies  that 
node  V  executes  L  times  for  every  execution  of  node  C/,  which  makes  sense  since  the 
output  of  node  U  is  expanded  by  L  before  reaching  node  V.  With  NyL  =  Nu,  the 
folding  equation  becomes 

Dp{U  — >  V^)  =  Ny{Lii  +  12)  ~  Pu  +  v  —  u,  (4.3) 

which  is  independent  of  the  iteration  number,  1.  The  samples  on  the  folded  arc  are  input 
to  Hy  at  Nul  +  u  +  Pu  +  Dp(U  ->  V)  =  Nul  +  Ny{Lii  +  12)  +  v,  so  the  folded  arc  is 
switched  at  the  input  of  Hy  at  Nul  +  Ny{Lii  +  (2)  +  u,  as  illustrated  in  Figure  4.7(b). 

For  the  case  of  L  =  1,  i.e.,  where  the  expander  does  not  affect  the  data  stream,  ii  and 
i2  can  be  combined  as  i  + 12,  and  Nu  —  Ny  =  N,  where  N  is  the  iteration  period 
of  nodes  U  and  V.  Substituting  these  expressions  into  (4.3)  gives  the  single-rate  folding 
equation  (4.1). 

4.4  Retiming  for  Folding 

Retiming  for  folding  is  the  process  of  retiming  a  DFG  so  the  number  of  delays  on  any 
folded  arc  is  nonnegative.  The  constraints  which  guarantee  this  for  single-rate  folding 
have  been  derived  in  [28].  In  this  section,  we  review  the  single-rate  constraint  and  derive 
the  retiming  constraints  which  ensure  that  the  number  of  folded  arc  delays  is  nonnegative 
for  multirate  folding. 
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4.4.1  Single-Rate  Case 


The  constraint  which  guarantees  that  the  number  of  folded  arc  delays  is  nonnegative  for 
single-rate  arcs  was  derived  in  [28]  to  be 


r{U)  -  riV)  < 


DfiU  ^  V^) 
N 


(4.4) 


This  equation  is  a  special  case  of  the  constraints  which  are  derived  in  the  next  subsection 
for  arcs  with  decimators  or  expanders. 


4.4.2  Multirate  Cases 


For  (4.2)  to  be  useful,  Dp{U  ->  V)  >  0  must  hold  given  a  feasible  schedule.  The  data¬ 
flow  graph  can  be  retimed  to  satisfy  this  condition.  Let  i\  and  ^2  be  the  number  of  delays 
on  arc  C/  -)•  V  after  retiming.  Using  (4.2),  the  number  of  delays  on  the  folded  arc  after 
retiming  is 

D'P{U  ->  U)  =  Nu{Mi'^  -b  i\)  -Pu  +  v-u. 

The  values  of  i\  and  i'2  are  related  to  and  12  by 

i\  =  ii  +  Mr(£)„„)  -  r{U) 

and 

i2  =  i2+r{V) -r{Duv), 

where  r{u)  and  r{v)  are  the  retiming  values  of  nodes  U  and  V,  respectively,  i.e.,  the 
number  of  times  one  delay  is  removed  from  each  of  the  output  arcs  of  the  node  and 
one  delay  is  added  to  each  of  the  input  arcs  of  the  node.  According  to  multirate  DSP 
fundamentals  reviewed  in  Section  4.2,  the  retiming  value  of  the  decimator,  r(D„„),  is 
the  number  of  times  one  delay  is  removed  from  its  output  and  M  delays  are  added  to 
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its  input.  Substituting  the  expressions  for  i\  and  we  find 


D'f{U  -4  V) 


Nu[M{i2  +  r{V)-r(Duv))+ii 
+Mr{Duv)  -  r{U)]  -  Pu  +  v-u 
D^{U  ->  F)  +  Nu{Mr{V)  -  r[U)), 


which  is  independent  of  r{Duv)-  We  can  retime  the  data-flow  graph  for  folding  by  forcing 
D'p{U  -4  V)  >  0,  which  gives 


r{U)  -  Mr(V)  < 


D^{U  V) 
Nu 


(4.5) 


Similarly,  we  can  use  retiming  to  guarantee  Dp{U  V)  >  0,  where  Dp{U  — >•  V)  is 
computed  as  in  (4.3).  If  and  are  the  number  of  delays  on  the  arc  after  retiming, 
then 

D'^{U  K)  =  Nv{Li\  +  i’2)  -Pu  +  v-u. 

The  expressions  for  i[  and  {'2  are 


i[  =  n  +r{Euv)  -r{U) 

and 

12  =  i2  +  r{V)  -  Lr{E.av), 

where  r{Euv)  is  the  retiming  value  of  the  expander,  which  is  the  number  of  times  we 
remove  L  delays  from  its  output  and  add  one  delay  to  its  input.  Substituting,  we  get 

D'^{U-^V)  =  Nv[L{ii+r{E^y)-r{U))+i2  +  r{V)-Lr{Euv)]-Pu  +  v-u 

=  Df{U  ^V)  +  Nv{r{V)-Lr{U)). 


as  the  number  of  folded  arc  delays  after  retiming.  Forcing  D'p{U  -4  V)  >  0  gives 

Df{U  V) 


Lr{U)  -  r{V)  < 


Nv 
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Caution  must  be  exercised  when  retiming  a  multirate  DFG  due  to  its  periodically 
time-varying  nature.  For  example,  consider  the  multirate  DFG  in  Figure  4.8(a).  If 
we  retime  this  DFG  by  assigning  the  adder  a  retiming  value  of  —1  and  assigning  the 
multiplier  a  retiming  value  of  0,  we  get  the  DFG  in  Figure  4.8(b).  The  problem  is  that 
these  two  circuits  have  completely  different  functionality.  In  the  single-rate  case,  retiming 
an  input  node  simply  results  in  a  delay  the  output  signal,  where  this  example  shows  that 
retiming  an  input  node  of  a  multirate  DFG  can  completely  change  the  functionality  of 
the  circuit.  This  issue  is  taken  into  consideration  in  the  design  example  in  Section  4.6. 


y(n)  a  y(n)  a 


(a)  (b) 

Figure  4.8:  (a)  A  multirate  DFG  which  computes  zi{n)  =  a{x{2n)  +  y{2n)).  (b)  Retimed 
version  which  computes  Z2{n)  =  a{x{2n  -  1)  +  y{2n  -  1)). 


4.5  Memory  Requirements  for  Folded  DSP  Architectures 

In  this  section,  we  derive  expressions  for  the  minimum  number  of  registers  required  by 
a  folded  architecture.  The  expressions  are  based  on  the  assumption  that  a  node  C/  in  a 
DFG  is  one  of  the  following  types: 

•  Type  S:  Each  outgoing  edge  of  node  U  contains  no  decimators  and  no  expanders. 

.  Type  D:  Each  outgoing  edge  of  node  U  contains  one  decimator  (J.  M)  and  no 
expanders. 

•  Type  E:  Each  outgoing  edge  of  node  U  contains  no  decimators  and  one  expander 

(ti^). 
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We  begin  by  computing  the  number  of  registers  required  to  store  the  output  signal  of 
a  Type  S  node.  We  then  compute  the  number  of  registers  required  to  store  the  output 
signals  of  Type  E  and  Type  D  nodes.  Finally,  we  compute  the  number  of  registers 
required  to  implement  a  DSP  algorithm  which  may  contain  Type  S,  Type  D,  and  Type  E 
nodes. 


4.5.1  Type  S  Nodes 


Consider  the  simple  case  of  an  arc  U  V  as  shown  in  Figure  4.5.  The  minimum  number 
of  registers  required  to  implement  the  folded  edge  in  Figure  4.5(b)  can  be  calculated  using 
life-time  analysis.  The  idea  is  to  compute  the  number  of  samples  which  exit  pipelined 
processor  Hu  and  enter  processor  Hy  prior  to  time  unit  K.  By  subtracting  the  number 
of  samples  which  enter  Hy  from  the  number  of  samples  which  exit  Hu,  we  find  the 
number  of  live  samples  at  time  unit  K.  The  minimum  number  of  registers  required  to 
implement  the  folded  edge  is  the  maximum  number  of  live  samples  over  all  K. 


As  in  Section  4.3,  we  assume  that  the  l-th  iteration  of  nodes  U  and  V  are  scheduled 
to  execute  at  time  units  Nul  +  u  and  Nyl  +  v,  respectively.  We  found  in  Section  4.3  that 
for  this  to  be  feasible  Nu  =  Ny  must  hold.  If  we  let  xj,  1  >  0,  be  the  result  of  the  /-th 
iteration  of  node  U,  then  the  production  time  of  x;,  which  is  the  time  unit  that  x/  exits 
pipelined  processor  Hu  in  Figure  4.5(b),  is  px,  =  Nul  +  u  +  Pu-  The  consumption  time 
of  X/,  which  is  the  time  unit  that  xi  enters  processor  Hy,  is  Cx,  =  Pi,  +  Dp[U  V). 
The  number  of  samples  which  have  production  times  prior  to  time  unit  K  (i.e.,  which 
satisfy  px,  <  K)  is 


rv,u{K) 


'K-Pxo' 

Nu 


(4.6) 


where  [x]  is  the  ceiling  of  x,  which  denotes  the  smallest  integer  greater  than  or  equal 
to  X.  The  number  of  samples  with  consumption  times  prior  to  time  unit  K  (i.e.,  which 


100 


satisfy  Cj,  <  K)  is 


cMK) 


^  ^Xq 

Nu 


(4.7) 


We  define  xi  to  be  live  over  the  interval  (pipCiJ.  Using  this  definition,  we  find  that  the 
number  of  samples  that  are  live  at  time  unit  K  is  given  by  ru^e^uiK)  =  rp^u{K)—rc,u{K), 
which  is 


'^live,ui^)  — 


'K-Pxo' 

Nu 

Nu 

(4.8) 


The  minimum  number  of  registers  required  to  implement  the  Df{U  V)  delays  in 
Figure  4.5(b)  is  the  maximum  value  of  rii„e,u{K)  over  all  K.  The  value  of  riive,ui^)  is 
periodic  in  K  with  period  Nu  because  the  folded  architecture  operates  periodically  with 
period  Nu.  Therefore,  we  only  need  to  evaluate  (4.8)  for  Nu  consecutive  time  units. 
Evaluating  (4.8)  at  time  units  K  =  qNu  +  n  for  some  integer  q  and  n  €  [0,  Nu)  results 
in  the  number  of  live  samples  at  time  partition  n,  given  by 


qNu  +  n-px^ 


qNu  +  n  -  [jpx^  +  D^[U  V)) 


Nu  \  \  Nu 

n-PxA  _  n-{pxo  +  D^F[U  U)) 
Nu  Nu 


The  minimum  number  of  registers  required  to  implement  the  D^{U  V)  delays  is  the 
maximum  value  of  r/j„e^[/(n)  over  the  interval  n  G  [0,  iVj/),  i.e.. 


(mai) 

^live,U 


e[o,Ari/) 


If  we  let  B  =  —pxQ,  A  =  Dp{U  V),  and  N  =  Nu,  then  Lemma  3.1  can  be  used  to 
show  that 

{max)  _ 

^live,U 

is  the  minimum  number  of  registers  required  to  implement  the  folded  edge  in  Fig¬ 
ure  4.5(b). 


dUu  V) 

Nu 
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The  more  general  case,  where  the  output  of  the  node  is  allowed  to  be  the  source  of 
one  or  more  arcs,  is  now  considered.  Let  £u  be  the  set  of  outgoing  edges  of  node  U .  We 
assume  for  this  discussion  that  node  C/  is  a  Type  S  node. 

If  xi  is  an  output  sample  of  node  C/,  then  the  latest  time  unit  in  which  xi  is  scheduled 
to  be  used  by  a  processor  is 

Ci,  =Px,  +  max|l>f(I7  A?)|.  (4.9) 

If  we  let 

^F,u  ^  a?)  J  , 

then  (4.9)  can  be  rewritten  as 


Ari 


=  Pxi  +  D 


S(max) 

F,U 


The  expressions  for  rp^ij{K)  and  rc,u{K)  for  the  output  signal  of  node  U  are  the  same 
as  in  (4.6)  and  (4.7),  and  the  number  of  live  samples  at  time  unit  K  is  given  by  (4.8). 
Substituting  p^^  =u  +  Py,  c^o  =  Pxo  +  and  K  =  qNu  +  n  into  (4.8)  gives  the 

number  of  live  samples  at  time  pcirtition  n  e  [0,  Nu),  which  is 


Flive^uiV-')  — 


3 

1 

o 

n 

1  Nu  1 

Nu 


(4.10) 


Lemma  3.1  can  be  used  to  show  that  the  maximum  of  the  expression  in  (4.10)  for 
n  e  [0,Nu)  is 

(max)  _ 

~live,U 

which  is  the  minimum  number  of  registers  required  to  implement  the  Type  S  node. 


Nu 


Example  4.1  Consider  the  Type  S  node  in  Figure  4’9(a),  where  the  the  iteration  periods 
for  the  nodes  are  Nu  =  Ny^  =  Ny^  =  2.  The  folding  orders  for  the  nodes  are  u  =  0, 
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Wi  =  0,  and  V2  =  I,  and  we  assume  that  node  U  is  executed  by  a  single-stage  pipelined 
processor,  i.e.,  Py  =  1.  The  folding  equations  are 


D^p{U-^Vi)  =  2(2) -1+0-0  =  3 

Df{U->V2)  =  2(1) -1  +  1-0  =  2, 


so  Dp^  ^  —  max  {3, 2}  —  3.  The  minimum  number  of  registers  required  to  implement 


this  Type  S  node  is 


This  can  also  be  seen  in  the  lifetime  chart  in  Figure  4- 9(b),  where  the  maximum  number 
of  live  samples  for  any  time  step  is  2. 


(a)  (b) 

Figure  4.9:  (a)  A  Type  S  node  U.  (b)  The  lifetime  chart  of  samples  in  the  folded 
architecture. 


4.5.2  Type  E  Nodes 

In  this  section  we  show  how  to  compute  the  minimum  number  of  registers  required  to 
store  the  output  signal  of  a  Type  E  node.  We  begin  by  computing  the  minimum  number 
of  registers  required  to  implement  the  folded  edge  in  Figure  4.7(b).  Let  xi  be  the  output 
of  the  l-th  iteration  of  U,  which  is  available  at  =  Nul  +  u  +  Pu.  This  sample  is 
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consumed  by  V  at  Cx,  —  px;  +  Dp{U  —^V).  At  time  unit  AT,  the  number  of  samples 
with  px,  <  K  IS 


rp,uiK)  = 


^  PXQ 

Nu 


One  sample  of  xi  is  produced  by  node  U  every  Ny  time  units.  Each  of  these  samples  is 
consumed  by  node  V,  so  one  sample  of  xi  must  be  consumed  by  node  V  every  Ny  time 
units  in  order  to  avoid  a  build-up  or  deficiency  of  samples  of  xi  on  the  folded  arc.  Since 
node  V  consumes  one  sample  of  x/  every  Ny  time  units,  the  number  of  samples  with 
Ci,  <  K  is 

rcMK)  = 


K-c. 


Xo 


Ny 


Keeping  Figure  4.7  in  mind,  it  is  interesting  to  note  that  while  U  produces  a  sample 
of  xi  every  Ny  time  units  and  V  consumes  a  sample  of  xi  every  Ny  time  units,  node 
V  is  actually  executed  in  hardware  once  every  Ny  —  NyfL  time  units.  As  a  result, 
only  (l/Z()-th  of  the  executions  of  node  V  in  hardware  are  used  to  process  the  output 
of  node  {/.  In  a  typical  multirate  system,  node  V  will  have  L  input  arcs,  each  of  which 
occupies  (1/ L)-th  of  the  executions  of  V  in  haxdware,  so  all  executions  of  V  in  hardware 
are  utilized. 


The  number  of  live  samples  of  i/  at  time  unit  K  is 

1'live,u{N)  = 


1 

o 

_ 1 

1 

o 

Ny 

Ny 

(4.11) 

Substituting  K  =  qNy  -f-  n  and  Cxq  =  Pxo  +  D§{U  V)  gives 

+  DfjU  ^  V) 

ltve,UK  ) 

which  is  the  number  of  live  samples  of  Xj  at  time  partition  n  6  [0,  Ny).  Lemma  3.1  can 
be  used  to  find  the  minimum  number  of  registers  required  to  implement  the  folded  arc. 
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which  is 

D§{U  ^  V)' 

Nu 

Computing  the  memory  requirements  for  a  general  Type  E  node,  i.e.,  where  the 
output  of  node  U  can  be  input  to  several  other  nodes  after  expansion  by  L,  is  quite 
simple.  Let  Ey  be  the  set  of  outgoing  edges  of  node  [/,  and  let 


(max) 

"Le.t;  = 


ne[0,Ny) 


The  production  time  of  xi  is  pxi  =  Nyl  +  u  +  Py,  and  the  consumption  time  is  c 


■Xl 


Pxi  +  The  number  of  live  samples  at  time  unit  K  is  given  by  (4.11),  so  we  can 

substitute  K  =  qNy  +  n  along  with  expressions  for  p^o  Cig  to  get 


'n-Pxo' 

Ny 

Ny 

and  it  follows  from  Lemma  3.1  that 


ne(0,Ny) 


r^E{max) 

^F,U 

Ny 


Example  4.2  Consider  the  Type  E  node  in  Figure  4- 10(a)  where  node  U  has  iteration 
period  Ny  =  6  and  nodes  Vi  and  V2  have  iteration  period  =  Ny^  —  2.  The  folding 
orders  for  the  nodes  are  u  =  2,  vi  =  0,  and  V2  =  1,  and  we  assume  that  node  U  is 
executed  by  a  single-stage  pipelined  processor,  i.e.,  Py  =  The  folding  equations  are 


Df{U^Vi)  =  2(3(2) +0)- 1+0-2  =  9 
Df{U  ->  ^2)  =  2(3(2)  +  1)  -  1  +  1  -  2  =  12, 


so  Ep  y  ^  —  max  {9, 12}  =  12.  The  minimum  number  of  registers  required  to  implement 
this  Type  E  node  is 


=  2. 


This  can  also  be  seen  in  the  lifetime  chart  in  Figure  4-10(b),  where  the  maximum  number 
of  live  samples  for  any  time  step  is  2. 
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Time 


#  live  samples 


(b) 

Figure  4.10:  (a)  A  Type  E  node  U.  (b)  The  lifetime  chart  of  samples  in  the  folded 
architecture. 

4.5.3  Type  D  Nodes 


In  this  section  we  show  how  to  compute  the  minimum  number  of  registers  required 
to  store  the  output  signal  of  a  Type  D  node.  We  begin  by  computing  the  minimum 
number  of  registers  required  to  implement  the  folded  edge  in  Figure  4.6(b).  Let  xi, 
/  >  0,  be  the  result  of  the  Z-th  iteration  of  U.  The  first  step  is  to  partition  xi  into  M 
subsequences  x^  =  XMj+m  for  i  >  0  and  m  €  [0,  M).  We  must  now  determine  which 
of  these  M  subsequences  of  xi  is  consumed  by  node  V.  To  determine  this,  recall  that 
y{k)  =  x{M{k  — 12)  —  ii)  in  Figure  4.6(a).  This  can  be  rewritten  as  y{k)  —  x{Mk2  +  ki) 
where 


k2  =  k  —  i2  — 
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and 


ki=M 


li 

M 


-ti. 


Notice  here  that  0  <  A:i  <  M  —  I  always  holds.  Based  on  this  analysis,  we  can  see 
that  node  V  in  Figure  4.6(a)  consumes  the  subsequence  =  XMj+m  for  j  >  0  and 

m  =  My^]-h. 

Sample  xj*  is  output  from  pipelined  processor  Hu  at  time  unit  p^m  =  Nu{Mj  +  m)  + 
u  +  Py.  This  sample  is  input  to  processor  Hy  at  time  unit  c^m  —  Pzj"  +  Dp{U  — >■  V'). 
One  can  see  from  these  expressions  that  one  sample  of  is  produced  and  consumed 
every  Ny  =  MNu  time  units.  At  time  unit  K,  the  number  of  samples  of  xj*  with 


Pi>r>  <  K  is 

3 


K  -px^ 

Nv 


and  the  number  with  consumption  times  satisfying  Cx”>  <  A"  is 


rc,uJK)  = 


r  /c  -  cjg 

Nv 


The  number  of  live  samples  of  xj*  at  time  unit  K  is 


■K-px^' 

K  -  Cim" 

Ny 

Ny 

(4.12) 

Substituting  K  =  qNy  +  n  for  integer  q  and  n  e  [0,  Ny)  and  c^m  =  Px-  +  T>F(C/  -4  V) 
gives 


- 1 

3 

1 

H 

o 

_ 1 

'n-{pxo  +  D^{U  ^V))' 

1  Ny  \ 

Ny 

Using  Lemma  3.1,  we  find  that  the  number  of  registers  required  to  implement  the  folded 
edge  in  Figure  4.6(b)  is 


Jrii„e,[;(n)}  = 
n6[0,Afv) 


D^{U-^V) 

Ny 


We  now  consider  the  memory  requirements  for  a  general  Type  D  node,  where  the 
output  of  node  U  may  be  the  input  to  several  nodes.  Let  £u„  denote  the  set  of  outgoing 


107 


edges  of  node  U  which  are  incident  into  nodes  which  consume  the  subsequence  xj^.  In 
other  words,  each  edge  e  E  satisfies  M  —  ii  =  m,  where  ii  is  the  number  of 
delays  on  e  between  U  and  the  decimator  on  e. 


The  number  of  live  samples  at  time  unit  K  for  the  edges  in  is  given  by  (4.12). 
The  production  time  of  is  still  =  Ny{Mj  +  m)  +  u  +  Py.  The  consumption  time 
is  now  c^m  =  pj.m  +  ,  where 


f-,D(max)  _ 


max 

e€Su„ 


{dP([/  4?)}. 


(4.13) 


Using  these  expressions  along  with  K  =  gNy  +  n  in  (4.12)  gives 


'^live,Um  (^) 


'n  -  (Nym  +  u  +  Py)' 

n 

-  {Nym  +  u  +  Py  + 

Ny 

Ny 

(4.14) 


as  the  number  of  live  samples  of  subsequence  xj*  at  time  partition  n  €  [0,  Ny). 
The  minimum  number  of  registers  required  to  implement  the  edges  in  £y^  is 


(max) 


n^\Q,Nv) 


Lemma  3.1  can  be  used  to  show  that 


{max) 
'''live, Urn 


p.D{max) 

^F,U,n 

Ny 


(4.15) 


The  amount  of  memory  required  to  store  XMi+m  can  be  determined  using  (4.15)  for 
each  m  €  [0,  M).  Therefore,  one  might  mistakenly  assume  that  the  number  of  registers 
required  to  store  all  output  samples  of  U  is  the  sum  of  the  minimum  number  of  registers 
required  to  store  each  of  the  M  subsequences  x^,  i.e.,  an  incorrect  expression  for  the 
minimum  number  of  registers  required  to  store  the  output  samples  of  node  U  is 


M-l 


j-jD(mai) 

^F,Um 


Ny 


(4.16) 
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The  correct  technique  is  to  find  the  maximum  value  over  n  6  [0,Nv)  of  the  sum  of  the 
number  of  live  samples  for  the  M  subsequences  x^.  Therefore,  to  examine  the  total 
number  of  live  samples  at  time  partition  n  E  [0,Nv),  we  use 

m=0 


and  take  the  maximum  of  this  expression.  Combining  (4.17)  with  (4.14)  results  in 


M-1 

E 

m=0 


n  -  {Num  +  U  +  Pu) 
Ny 


n  -  {Num  +  u  +  Pu  + 

iVu 


4.18) 


The  minimum  number  registers  required  to  store  the  output  samples  of  node  U  is  the 
maximum  of  riiye,u{n)  over  the  interval  [0,iVv^),  given  by 


(max)  _ 
Wtve.C/  - 


^  {nive,4/(n)} . 


(4.19) 


We  now  summarize  the  technique  for  determining  the  minimum  number  of  registers 
required  to  implement  the  output  of  a  Type  D  node. 


1.  Partition  the  outgoing  edges  of  node  U  into  M  sets  Su^,  where  an  edge  e  €  Sum 

has  ii  delays  between  U  and  the  decimator  on  e,  and  M  =  m  holds. 

2.  Compute  the  quantity  in  (4.13)  for  m  E  [0,  M). 

3.  Compute  the  minimum  number  of  registers  using  (4.18)  and  (4.19). 


Example  4.3  In  this  example  we  compute  the  memory  requirements  for  the  Type  D 
node  in  Figure  4-11-  The  iteration  periods  of  the  nodes  are  Nu  =  2  and  Ny^  =  6  for 
i  =  0, 1, 2, 3.  The  folding  orders  are  u  =  1,  vq  =  I,  vi  =  2,  V2  =  4,  and  vz  =  5.  Node  U 
is  assigned  to  a  processor  which  is  pipelined  by  one  stage,  i.e.,  Pu  =  1.  Let  Cj  be  the 
label  of  the  edge  from  node  U  to  node  Vi,  i.e.,  the  four  edges  of  the  DFG  are  U  ^  Vi 
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for  i  =  0,1,2, 3.  Recall  that  is  the  set  of  edges  which  connect  node  U  to  nodes 
which  consume  samples  x{Ml  +  m),  m  e  [0,M),  where  x{n)  is  the  output  of  node  U,  so 
£uq  =  {62,63},  £ui  =  {60},  and  £u2  =  {61}.  The  folding  equations  are 

D^iU^Vo)  =  2(3(l)  +  2)-l  +  l-l  =  9 

D^iU^Vi)  =  2(3(0)  +  !)- 1  +  2- 1  =  2 

Dp{U  V2)  =  2(3(0)  +  3)  —  1+4  —  1  =  8 

Dp{U  -4  Vz)  =  2(3(2)  +  0)  —  1  +  5  —  1  =  15, 

and  the  values  of  are  as  shown  in  Table  4-T 

Table  4.1;  Values  of  for  Example  4.3. 


Figure  4.11:  A  Type  D  node  U  with  several  fanout  arcs. 


The  correct  way  to  compute  the  minimum  number  of  registers  is  to  use  (4-19),  which 
for  this  example  is 


{max) 

"Hve,U 


max 

n6[0,6) 


,  Tn=0 


n  —  (2m  +  1  +  1) 


n  -  (2m  +  1  +  1  +  Dp^y^^'>) 


no 


r 

"n  —  2" 

’n-  17' 

'n  —  4' 

max  < 

ne[0,6)  ( 

6 

6 

+ 

6 

=  max{4, 5,4,4,4, 5}  =  5. 


'n  —  13" 
6 


+ 


'n  —  6" 

'n  —  8' 

6 

6 

} 


To  see  that  (4-16)  does  not  compute  the  minimum  number  of  registers,  note  that  (4-16) 
gives 

2 

E 

m=:0 

which  is  one  larger  than  the  minimum  number  of  required  registers. 

The  lifetime  chart  [51]  which  verifies  that  5  registers  are  required  is  shown  in  Fig¬ 
ure  4-12. 


p.D[Tnax) 

^F,Um 


—  3  +  2  +  1  —  6, 


4.5.4  Memory  requirements  for  a  general  DFG 

Consider  a  DFG,  where  a  node  in  the  DFG  can  be  a  Type  S,  Type  D,  or  Type  E  node. 
Let  U  denote  the  set  of  nodes  in  the  DFG  which  are  Type  S,  Type  D,  or  Type  E  nodes. 
Based  on  the  derivations  of  this  section,  we  can  write  the  expression  for  the  number  of 
live  samples  in  the  folded  architecture  for  time  unit  n  as 

niue(n)  =  y  ]  r/jue,t/(n),  (4.20) 

U&4 

where  the  expressions  for  riiye,u{‘n)  are  summarized  in  Table  4.2.  The  minimum  number 
of  registers  required  to  implement  this  architecture  is  the  maximum  value  of  riiye{n)  over 
the  interval  [0,  Nicm)i  where  Nicm  is  the  least  common  multiple  of  the  denominators  of  all 
of  the  ceiling  functions  in  (4.20).  These  concepts  are  now  demonstrated  in  the  following 
example.  This  example  is  intended  to  demonstrate  the  memory  minimization  techniques 
for  multirate  folding  that  are  introduced  in  this  section.  Examples  which  demonstrate 
how  to  use  multirate  folding  to  synthesize  useful  architectures,  such  as  those  for  M-ary 
tree  structured  filter  banks,  are  given  in  Section  4.6  and  in  [36]. 


Ill 


Figure  4.12;  The  lifetime  chart  for  Example  4.3.  The  folded  implementation  requires  5 
registers  since  this  is  the  maximum  number  of  live  samples  at  any  time  step. 

Example  4.4  Consider  the  multirate  DFG  in  Figure  4-13.  In  this  figure,  A  is  a  Type  D 
node,  B  and  C  are  Type  S  nodes,  D  and  E  are  Type  E  nodes,  and  F  is  a  sink  node. 
The  iteration  periods  for  the  nodes  are  Na  =  Np  =  1  and  Nb  =  Nc  =  Nq  =  Ne  =  2. 
The  folding  orders  are  a  =  0,  b  =  I,  c  =  0,  d  =  0,  e  =  1,  and  f  =  0.  Each  node  is 
executed  in  hardware  by  a  processor  which  is  pipelined  by  one  stage,  so  Pa  =  Pb  =  Pc  = 
Pd  =  Pe  =  Pf  =  I-  In  the  architecture,  nodes  B  and  C  are  time  multiplexed  to  the 
same  processor,  and  nodes  D  and  E  are  time  multiplexed  to  the  same  processor.  Based 
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Table  4.2:  Summary  of  the  expressions  for  ruye,u{n)  for  the  various  types  of  nodes.  Note 
that  u  is  the  folding  order  of  node  U,  and  P(j  is  the  number  of  pipelining  stages  in 
hardware  unit  Hr;  which  executes  node  U. 


Node  Type 

Expression  for  riiye,u{n) 

S 

fn-fu+Pu)! 

n-(u+P(/+r»p'y“®') 

1  N  1 

N 

D 

1  f  fn— (yVum+u+Pu)! 

n—(Num+u+Pi;+Dp^^^^^) 

3 

II 

O 

Nv  ] 

E 

rn-(u+Py)] 

n-(u+Pt;  +  D^'^“*>)' 

1  Nu  1 

Nu 

Figure  4.13:  Multirate  DFG  for  Example  4.4. 
on  these  parameters,  the  folding  equations  are 


D^{A  ->  B) 

=  1(2(1) +0)  -  1  +  1 -0  =  2 

D^{A  C) 

=  1(2(1)  +  !)- 1+0 -0  =  2 

Df{B  D) 

=  2(1) -1  +  0- 1  =  0 

D^B  ->  E) 

=  2(1)  -  1  +  1  -  1  =  1 

Df{C  ->  D) 

=  2(1) -1  +  0 -0  =  1 

Df{C  E) 

=  2(1) -1  +  1 -0  =  2 

Df{D  ->  F) 

=  1(2(1)  +  !)- 1  +  0 -0  =  2 
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Df{D-^F)  =  1(2(1) +0)- 1+0- 1  =  0. 


The  maximum  fanout  values  are  =  2,  =  2,  =  1,  =  2, 

j^E(max)  _  2^  j^pmax)  _  q  Ji^caU  node  A  has  two  maximum  fanout  values  (for 
m  =  0  and  m  =  1)  because  it  is  a  Type  D  node  with  decimation  by  M  =  2  on  each  of  its 
output  arcs. 

The  number  of  live  samples  at  time  partition  n  is  given  by 

Tliveipf  —  'y  ^  niue,[/(^) 

Ue{A.B,C,D,E} 


which  is 

Tlivein)  = 


n  —  1 


+ 


2 

n  -  1 


n  —  3 


2 

n  -  3 


+ 


-2' 

I 

3 

1 

_ 1 

+ 

'n  —  2‘ 

- 1 

3 

1 

CO 

2 

1  2  1 

2 

1  2  1 

n  — 

11 

fn- 

31 

.  fn- 

21 

fn  - 

fn  —  1 

~T~ 


+ 


where  the  first  two  terms  are  for  Aq,  the  next  two  for  Ai  followed  by  two  terms  each  for 
nodes  B,  C,  D,  and  E.  The  minimum  number  of  registers  required  for  the  architecture 
is 

=  max  {rH„e(n)}  =  max{4,5}  =  5. 
ne{o,i} 

One  implementation  which  uses  5  registers  is  shown  in  Figure  4'^4t  ivhere  processor  Pi 
executes  node  A,  processor  P2  executes  nodes  B  and  C,  processor  P3  executes  nodes  D 
and  E,  and  processor  P4  executes  node  F. 


4.6  Design  Example 


In  this  section  we  give  an  example  which  illustrates  how  the  folding  equations,  retiming 
for  folding  constraints,  and  memory  minimization  can  be  used  to  synthesize  a  single-rate 
architecture  for  a  multirate  DSP  algorithm.  The  algorithm  we  consider  is  the  three-level 


114 


OUT 


Figure  4.14:  Folded  architecture  for  Example  4.4.  D  denotes  an  internal  pipelining  delay, 
while  Ri  denote  external  registers.  This  implementation  uses  five  registers,  which  is  the 
minimum  value  computed  in  the  example. 


orthogonal  discrete  wavelet  transform  analysis  filter  bank  which  uses  third-order  wavelet 
filters,  as  shown  in  Figure  4.15  [5].  The  schedule  for  the  architecture  is  given  in  Table  4.3. 

The  steps  we  take  in  deriving  the  folded  architecture  are  as  follows: 


1.  Write  the  folding  equations  for  the  DFG. 

2.  Write  the  retiming-for-folding  constraints  and  find  a  solution. 

3.  Write  the  folding  equations  for  the  retimed  DFG. 

4.  Determine  the  memory  requirements  for  the  folded  architecture. 

5.  Allocate  data  to  the  minimum  number  of  registers. 

6.  Draw  the  folded  architecture. 


Each  of  these  steps  is  described  in  detail  in  the  following  subsections. 
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Figure  4.15:  A  three-level  orthogonal  discrete  wavelet  transform  analysis  filter  bank 
which  uses  third-order  wavelet  filters. 

4.6.1  Folding  Equations  for  the  Original  DFG 

The  multirate  DFG  in  Figure  4.15  has  36  single-rate  edges  and  6  multirate  edges  which 
contain  decimators.  The  number  of  folded  delays  on  each  edge  prior  to  retiming  is  given 
in  Table  4.4  for  the  single-rate  edges  and  in  Table  4.5  for  the  multirate  edges.  These 
values  are  computed  by  using  the  number  of  delays  on  the  edges  in  the  DFG  and  the 
schedule  in  Table  4.3  and  plugging  these  values  into  (4.1)  and  (4.2). 
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Table  4.3:  Schedule  for  the  three-level  orthonormal  DWT  example.  The  numbers  across 
the  top  of  the  table  represent  the  eight  time  partitions.  An  X  denotes  a  null  operation, 


0 

1 

2 

3 

4 

5 

6 

— 

Processor  Mi 

Mio 

Mn 

Mio 

M\2 

Mio 

Mil 

Mio 

X 

Processor  M2 

M20 

M21 

M20 

M22 

M20 

M21 

M20 

X 

Processor  M3 

M30 

X 

M30 

M31 

M30 

M32 

M30 

M31 

Processor  M4 

M40 

X 

M40 

M41 

M40 

M42 

M40 

M41 

Processor  M5 

M52 

Afso 

M51 

^50 

X 

M50 

M51 

M50 

Processor  Me 

M62 

Meo 

Mei 

Mso 

X 

Meo 

Mei 

Mso 

Processor  A\ 

HOI 

Bffli 

■Aio 

X 

-4io 

E|9 

Aio 

-4i2 

Processor  Aq 

A20 

A21 

X 

A20 

-421 

A20 

-422 

Processor  A3 

A31 

A30 

-432 

A30 

A31 

A30 

X 

■430 

Processor  A4 

A41 

-440 

A42 

A40 

X 

-440 

4,6.2  Retiming  for  Folding 


There  are  36  retiming  for  folding  equations  for  single-rate  edges  and  6  for  multirate  edges. 
These  are  given  in  Table  4.4  for  the  single-rate  edges  and  in  Table  4.5  for  the  multirate 
edges.  The  retiming  for  folding  equations  used  Eire  (4.4)  and  (4.5).  The  columns  labeled 
Ruv  give  the  values  for  the  right- hand-side  of  the  inequalities  for  each  edge.  Note  that 
we  also  impose  the  constraint  r{IN)  =  0.  This  constraint  avoids  the  possibility  of  adding 
new  delays  at  the  input  which  can  have  the  effect  of  changing  the  functionality  of  the 
circuit  as  was  described  in  Section  4.4.  The  columns  labeled  r{U)  and  r(V)  in  Tables  4.4 
and  4.5  give  a  solution  to  these  inequalities. 

4.6.3  Folding  Equations  for  the  Retimed  DFG 


Based  on  the  retiming  values  for  the  nodes,  folding  equations  can  be  written  for  the 
retimed  graph.  Because  the  retiming  solutions  satisfy  all  of  the  retiming-for-folding 
equations,  the  folding  equations  now  result  in  a  nonnegative  number  of  delays  for  each 
folded  edge.  The  new  folding  equations  are  given  in  Table  4.4  for  the  single-rate  edges 
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and  in  Table  4.5  for  the  multirate  edges. 

4.6.4  Memory  Requirements  of  the  Folded  Architecture 

The  memory  in  the  folded  architecture  can  be  found  using  (4.20).  Since  the  architecture 
implements  the  retimed  DFG,  the  number  of  delays  on  the  folded  edges  for  the  retimed 
graph  are  used  in  the  expressions  in  Table  4.2.  An  important  point  is  that  an  edge  with 
a  decimator  can  change  from  set  £lJ^  to  Eu-  as  a  result  of  retiming,  and  this  change 
must  be  taken  into  account  to  get  an  accurate  evaluation  of  the  memory  required  by  the 
folded  architecture.  Taking  this  into  account,  the  minimum  number  of  registers  required 
to  implement  the  folded  architecture  is  14. 

4.6.5  Allocate  Data  to  the  Minimum  Number  of  Registers 

To  keep  routing  simple,  we  attempted  to  localize  data  within  the  architecture  while  still 
using  only  14  registers.  For  example,  we  were  able  to  allow  only  the  output  samples  of 
multiplier  M\  to  occupy  registers  i?l  and  R2  (see  Figure  4.16),  which  avoids  routing  the 
outputs  of  other  processors  to  these  two  registers.  Allocation  techniques  proposed  in  [51] 
were  used  to  allocate  the  data  to  the  14  registers. 

4.6.6  The  Folded  Architecture 

The  folded  architecture  is  shown  in  Figure  4.16.  This  architecture  uses  the  theoretical 
lower  limit  of  14  registers.  Delays  denoted  as  D  are  internal  pipelining  delays,  while 
the  14  external  registers  are  labeled  Ri.  The  fact  that  this  architecture  has  the  same 
functionality  as  the  DFG  shown  in  Figure  4.15  has  been  verified  by  simulation  using 
Matlab  Simulink. 

This  is  not  the  only  architecture  which  can  be  designed  for  this  algorithm  using 
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multirate  folding.  We  have  also  designed  a  different  architecture,  which  uses  only  three 
multipliers  and  two  adders,  using  the  systematic  multirate  folding  technique  proposed 
in  this  chapter  for  the  three-level  orthogonal  discrete  wavelet  transform  which  uses  7-th 
order  FIR  filters,  but  this  example  is  not  included  to  save  space.  This  demonstrates 
that  multirate  folding  can  be  used  to  design  a  broad  class  of  single-rate  architectures  for 
multirate  DSP  applications. 


Figure  4.16:  Folded  architecture  for  the  three-level  orthogonal  discrete  wavelet  transform 
analysis  filter  bank  which  uses  third-order  wavelet  filters.  If  an  input  to  a  switch  is  not 
labeled,  then  this  input  is  switched  in  at  all  time  units  not  assigned  to  other  inputs  of 
the  switch. 


4.7  Conclusions 

A  novel  multirate  folding  transformation  has  been  developed  for  mapping  multirate 
DSP  algorithms  to  single-rate  VLSI  architectures.  This  transformation  can  be  used 
to  synthesize  architectures  for  a  wide  range  of  DSP  applications  which  use  multirate 
algorithms,  such  as  signal  coding  and  analysis  and  adaptive  signal  processing. 

Multirate  folding  equations  were  derived  for  arcs  which  contain  decimators  or  ex¬ 
panders.  In  both  cases,  the  folding  equation  contains  single-rate  folding  as  a  special 
case.  These  folding  equations  were  then  used  to  solve  two  important  related  prob- 
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Table  4.4:  Folding  and  retiming  equations  for  the  single-rate  edges  in  the  DWT  example. 


u 

folded  delays 
before  retiming 

r(f/) 

r(F) 

Ruv 

folded  delays 
after  retiming 

M\q 

Aio 

-2 

0 

2 

-1 

2 

Mio 

A/30 

-2 

0 

1 

-1 

0 

A20 

-2 

0 

2 

-1 

2 

M20 

A/40 

-2 

0 

1 

-1 

0 

A/30 

A20 

-2 

1 

2 

-1 

0 

A/40 

Aio 

-2 

1 

2 

-1 

0 

■^10 

A30 

0 

2 

3 

0 

2 

-^10 

A/50 

0 

2 

2 

0 

0 

A.20 

A40 

2 

2 

3 

1 

4 

Mo 

A/eo 

2 

2 

2 

1 

2 

A/50 

A40 

-2 

2 

3 

-1 

0 

A/eo 

A30 

-2 

2 

3 

-1 

0 

A/ll 

All 

-2 

2 

3 

-1 

2 

A/ll 

A/31 

0 

2 

2 

0 

0 

A/21 

A21 

-2 

2 

3 

-1 

2 

A/21 

A/41 

0 

2 

2 

0 

0 

A/31 

A21 

-4 

2 

3 

-1 

0 

A/41 

All 

-4 

2 

3 

-1 

0 

All 

A31 

-2 

3 

4 

-1 

2 

Af51 

0 

3 

3 

0 

0 

A21 

A41 

2 

3 

4 

0 

6 

A21 

A/ei 

4 

3 

3 

1 

4 

A/51 

A41 

-4 

3 

4 

-1 

0 

A/si 

A31 

-4 

3 

4 

-1 

0 

A/12 

2 

2 

2 

0 

2 

A/12 

A/32 

0 

2 

2 

0 

0 

A/22 

A22 

2 

2 

2 

0 

2 

A/22 

A/42 

0 

2 

2 

0 

0 

A/32 

A22 

0 

2 

2 

0 

0 

A/42 

Ai2 

0 

2 

2 

0 

0 

Ai2 

A32 

-6 

2 

3 

-1 

2 

A/52 

-8 

2 

3 

-1 

0 

A22 

A42 

2 

2 

3 

0 

10 

A22 

A/62 

0 

2 

3 

0 

8 

A/52 

A42 

0 

3 

3 

0 

0 

a/62 

A32 

0 

3 

3 

0 

0 
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Table  4.5:  Folding  and  retiming  equations  for  the  multirate  edges  in  the  DWT  example. 


u 

V 

folded  delays 
before  retiming 

r{U) 

r(F) 

Ruv 

folded  delays 
after  retiming 

IN 

Mio 

0 

0 

0 

0 

0 

IN 

M20 

1 

0 

0 

1 

1 

■A30 

Mn 

-1 

3 

2 

-1 

1 

■A30 

M21 

1 

3 

2 

0 

3 

^31 

Mi2 

2 

4 

2 

0 

2 

A31 

M22 

6 

4 

2 

1 

6 

lems,  namely,  memory  minimization  in  folded  architectures  and  retiming  for  folding.  By 
deriving  the  multirate  folding  equations  and  solving  these  related  problems,  we  have 
formalized  several  crucial  steps  used  in  mapping  multirate  DSP  algorithms  to  efficient 
VLSI  architectures. 

A  detailed  design  example  of  a  three-level  discrete  wavelet  transform  analysis  filter 
bank  was  given.  This  example  demonstrated  how  the  multirate  folding  equations,  along 
with  retiming  for  folding  and  memory  minimization,  can  be  used  to  design  single-rate 
architectures  for  multirate  algorithms.  Multirate  folding  can  be  used  to  design  architec¬ 
tures  for  a  wide  variety  of  filter  banks  as  described  in  [36]. 


121 


Chapter  5 

Two-Dimensional  Retiming 

5.1  Introduction 

Retiming  [27]  is  a  technique  used  to  move  delay  elements  around  in  a  circuit  without 
changing  its  functionality.  One  effect  of  changing  the  locations  of  the  delays  is  that 
combinational  rippling  can  be  reduced,  allowing  the  the  circuit  to  be  clocked  at  a  higher 
rate.  Reducing  combinational  rippling  also  decreases  the  dynamic  power  dissipation  in 
the  circuit  [48]  and  allows  the  circuit  to  be  operated  with  a  lower  supply  voltage,  both  of 
which  lead  to  low  power  implementations  [67].  Another  effect  of  changing  the  locations 
of  delays  is  that  the  number  of  delay  elements  required  can  be  reduced,  resulting  in 
area-efficient  implementations.  In  addition  to  retiming  for  high  speed,  low  power,  and 
low  area  implementations,  retiming  is  also  an  important  step  in  scheduling  for  high-level 
synthesis  [11]  -[38].  All  of  these  applications  of  retiming  have  been  studied  for  circuits 
which  operate  on  one-dimensional  signals,  such  as  digital  audio. 

Two-dimensional  retiming  [33,  34]  is  used  to  retime  data-flow  graphs  (DFGs)  which 
operate  on  two-dimensional  signals  such  as  images.  As  digital  image  processing  becomes 
more  popular  in  multimedia  applications,  the  need  for  high  speed,  low  area,  and  low 
power  implementations  of  multidimensional  digital  signal  processing  (DSP)  algorithms 
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increases.  Like  one-dimensional  retiming  [27],  two-dimensional  retiming  can  be  used 
to  increase  the  sample  rate,  reduce  the  area,  and  reduce  the  power  consumed  by  a 
synchronous  circuit. 

Techniques  for  reducing  the  execution  times  of  2-D  DSP  algorithms  have  been  con¬ 
sidered  in  the  past.  One  way  to  speed  up  these  algorithms  is  to  process  many  iterations 
concurrently,  and  it  has  been  shown  that  this  is  often  possible  if  the  2-D  data  are  not 
processed  in  line-by-line  or  column-by-column  order,  but  rather  are  processed  diagonally 
[69,  70].  This  technique  requires  an  increase  in  the  number  of  arithmetic  units.  Another 
way  to  speed  up  these  algorithms  is  to  reduce  the  sample  period  using  2-D  retiming  tech¬ 
niques  [33,  34].  This  technique  does  not  require  an  increase  in  the  number  of  arithmetic 
units;  however,  as  we  show  in  this  chapter,  the  algorithm  for  2-D  retiming  in  [34]  often 
results  in  an  implementation  which  requires  significantly  more  memory  than  is  actually 
needed.  Since  the  area  consumed  by  the  implementation  of  a  2-D  DSP  algorithm  can  be 
dominated  by  memory  requirements  [71],  it  is  important  to  keep  the  memory  require¬ 
ments  as  small  as  possible.  The  algorithm  for  2-D  retiming  in  [33]  is  not  very  flexible 
because  it  is  only  compatible  with  some  very  specific  processing  orders  of  the  data. 

In  this  chapter,  we  present  two  techniques  for  retiming  two-dimensional  data-flow 
graphs  (2DFGs).  Each  of  these  techniques  minimizes  the  amount  of  memory  required  to 
implement  the  2DFG  under  a  clock  period  constraint.  The  first  technique,  called  ILP 
2-D  retiming,  is  based  on  an  integer  linear  programming  (ILP)  formulation  which  consid¬ 
ers  the  2-D  retiming  formulation  as  a  whole.  While  this  technique  gives  excellent  results, 
it  has  slow  convergence  for  large  2DFGs.  The  second  technique,  called  orthogonal  2-D 
retiming,  is  formulated  by  breaking  ILP  2-D  retiming  into  two  linear  programming  prob¬ 
lems,  where  each  problem  can  be  solved  in  polynomial  time.  The  downfall  of  orthogonal 
2-D  retiming  is  that  the  results  of  the  two  linear  programming  problems  can  sometimes 
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be  incompatible.  A  variation  of  orthogonal  2-D  retiming  called  integer  orthogonal  2-D 
retiming  is  also  based  on  a  linear  programming  formulation,  and  this  technique  solves  the 
incompatibility  problem  which  may  be  encountered  using  orthogonal  2-D  retiming.  The 
techniques  presented  in  this  chapter  result  in  retimed  2DFGs  which  require  less  memory 
than  than  the  technique  in  [34]  and  are  compatible  with  considerably  more  processing 
orders  of  the  data  than  the  technique  described  in  [33]. 

This  chapter  is  organized  as  follows.  Section  5.2  describes  some  specifics  of  two- 
dimensional  data  processing.  Section  5.3  contains  the  ILP  2-D  retiming  formulation. 
Orthogonal  2-D  retiming  and  integer  orthogonal  2-D  retiming  are  presented  in  Sec¬ 
tions  5.4  and  5.5,  respectively.  Comparisons  with  previous  work  are  given  in  Section  5.6 
and  our  conclusions  are  in  Section  5.7. 

5.2  Processing  Two-Dimensional  Data  Sets 

A  two-dimensional  DSP  algorithm  can  be  represented  using  a  two-dimensional  data-flow 
graph  (2DFG).  A  2DFG  G  =<  V,E,'w,d  >  is  a  node-weighted  and  edge-weighted  graph 
such  that 

•  y  is  the  set  of  vertices  (nodes)  in  G.  The  nodes  represent  computations. 

•  is  the  set  of  edges  in  G.  The  edges  represent  communication  between  the  nodes. 

•  w(e)  is  a  2  X  1  vector  representing  the  dependency  on  edge  e. 

•  d{v)  is  a  nonnegative  scalar  representing  the  computation  time  of  node  v. 

As  an  example,  the  2DFG  in  Figure  5.1  describes  the  computation  y(ni,n2)  =  b  + 
ax{ni  -I- 1,722  —  !)•  An  iteration  is  the  execution  of  each  node  in  the  2DFG  exactly  once. 
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a  b 


Figure  5.1:  A  2DFG  which  describes  the  computation  y(ni,n2)  =  6  +  aa;(ni  +  l,n2  —  1). 

5.2.1  Overview  of  Two-Dimensional  Retiming 

The  1-D  retiming  equation  given  in  [27]  for  the  edge  u  A  v  in  a  1-D  DFG  is  given  by 

Wf{e)  =  w{e)  +  r(v)  —  r(u), 

where  w(e)  and  Wr(e)  are  the  numbers  of  delays  on  e  before  and  after  retiming,  respec¬ 
tively,  and  r(u)  and  r(v)  are  the  retiming  values  of  nodes  u  and  v,  respectively.  The  2-D 
retiming  equation  for  the  edge  u  A  v  in  a  2DFG  is  given  by 

Wr(e)  =  w(e)  -1-  r(i;)  -  r(u),  (5.1) 

where  w(e)  and  Wr(e)  are  the  2x1  dependence  vectors  on  e  before  and  after  retim¬ 
ing,  respectively,  and  r(u)  and  r(u)  are  the  2x1  retiming  vectors  of  nodes  u  and  v, 
respectively. 

A  1-D  retiming  r  is  said  to  be  legal  if  Wr{e)  >  0  for  all  e  E  E.  The  conditions  for  a 
legal  2-D  retiming  are  derived  in  Section  5.3.1. 

5.2.2  Types  of  Parallelism  Available  in  2-D  Signal  Processing 

There  are  two  types  of  parallelism  available  in  2-D  signal  processing.  The  first  type  of 
parallelism  is  inter-iteration  parallelism  which  can  be  achieved  by  increasing  the  amount 
of  hardware  so  that  the  multiple  iterations  can  be  executed  concurrently.  For  example, 
consider  the  2DFG  in  Figure  5.2(a)  which  implements  j/(ni,n2)  =  ay{ni  —  l,n2)  + 
6y(ni,n2  —  1)  -f  x(ni,n2).  Assume  that  this  2DFG  is  used  to  process  a  3  x  3  data  set. 
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Table  5.1:  Four  possible  execution  orders  for  the  DFG  in  Figure  5.2(a)  assuming  a  3  x  3 
data  set.  _ 


■1 

row-by-row 

serial 

column-by-column 

serial 

diagonal 

serial 

parallel 

Step  1 

y(o,o) 

y(o,o) 

y(o,o) 

y(o,o) 

Step  2 

y(i,o) 

y(o,i) 

y(i,o) 

y(0,l),y(l,0) 

Step  3 

y(2,o) 

y(o,2) 

y(o,i) 

y(0,2),y(l,l),y(2,0) 

Step  4 

y(o,i) 

y(i,o) 

y(2,o) 

y(l,2),y(2,l) 

Step  5 

y(i.i) 

y(i,i) 

y(i,i) 

y(2,2) 

Step  6 

y(2,i) 

y(i,2) 

y(o,2) 

- 

Step  7 

y(o,2) 

y(2,o) 

y(2,i) 

— 

Step  8 

y(i,2) 

y(2,i) 

y(i,2) 

— 

Step  9 

y(2,2) 

y(2,2) 

y(2,2) 

— 

The  output  values  y{n\,n2)  are  dependent  on  one  another  as  shown  in  Figure  5.2(b), 
where,  e.g.,  the  arrow  from  j/(l,0)  to  y(l,l)  indicates  that  y(l,0)  must  be  computed 
before  y(l,  1)  can  be  computed.  Four  possible  execution  orders  are  given  in  Table  5.1. 


Figure  5.2:  (a)  A  2DFG  which  describes  the  computation  y(ni,n2)  =  ay{ni  —  l,n2)  + 
6y(ni,n2  —  1)  +  x{n).  (b)  The  dependencies  for  this  2DFG  assuming  it  operates  on  a 
3x3  data  set. 


While  the  three  serial  execution  orders  require  a  single  hardware  module  and  9  time 
steps  to  execute,  the  parallel  execution  order  requires  3  hardware  modules  and  only  five 
time  steps  to  execute,  where  a  hardware  module  is  capable  of  executing  one  iteration  in 


126 


one  time  step.  The  parallel  execution  order  uses  inter-iteration  parallelism  to  speed-up 
the  execution  of  the  2-D  signal  processing  algorithm. 

The  second  type  of  parallelism  is  inter- operation  parallelism.  This  involves  retiming 
the  2DFG  so  operations  can  be  executed  in  parallel,  resulting  in  a  shorter  clock  period. 
For  the  2DFG  in  Figure  5.2(a),  assume  addition  and  multiplication  require  1  and  2  time 
units,  respectively.  The  minimum  clock  period  for  this  2DFG  is  4  time  units  because 
there  is  a  path  through  two  adders  and  one  multiplier  (e.g.,  through  nodes  4,  2,  and 
1)  which  has  no  delays.  As  a  result,  the  time  required  to  process  the  3x3  data  set 
using  a  serial  processing  order  is  (4)(9)  =  36  time  units.  The  2DFG  in  Figure  5.2(a) 
can  be  retimed  as  shown  in  Figure  5.3  assuming  r(l)  =  [  0  0  ]^,  r(2)  =  [  0  0  ]^, 
r(3)  =  [  -2  1  ]^,  and  r(4)  =  [  -1  0  ]^.  This  retimed  2DFG  has  a  minimum  clock 
period  of  2  time  units  because  the  longest  path  with  no  delays  is  through  a  multiplier 
or  two  adders.  The  time  required  to  process  the  3x3  data  set  using  the  diagonal  serial 
processing  order  is  now  (2)  (9)  =  18  time  units,  so  2-D  retiming  has  allowed  us  to  speed 
up  the  processing  by  a  factor  of  2. 

The  reason  that  2-D  retiming  allows  the  circuit  to  be  clocked  faster  is  because  oper¬ 
ations  in  the  retimed  circuit  can  be  executed  in  parallel.  Table  5.2  shows  some  possible 
execution  times  for  the  nodes  in  the  unretimed  2DFG  (Figure  5.2(a))  and  the  retimed 
2DFG  (Figure  5.3).  Since  multiplication  and  addition  in  the  retimed  2DFG  can  be  per¬ 
formed  in  parallel  rather  than  sequentially,  2-D  retiming  allows  for  an  implementation 
where  operations  are  executed  in  parallel,  hence  the  name  inter- operation  parallelism. 
The  remainder  of  this  chapter  assumes  that  a  2-D  data  set  is  processed  using  a  serial  pro¬ 
cessing  order,  and  we  focus  on  exploiting  inter-operation  parallelism  using  2-D  retiming. 
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Figure  5.3:  A  retimed  version  of  the  2DFG  in  Figure  5.2(a). 


Table  5.2:  Possible  execution  times  for  the  unretimed  2DFG  in  Figure  5.2(a)  and  the 
retimed  2DFG  in  Figure  5.3  assuming  that  addition  and  multiplication  require  1  and 
2  units  of  time,  respectively.  The  unretimed  2DFG  does  not  allow  addition  and  mul¬ 
tiplication  to  be  executed  in  parallel,  while  the  retimed  2DFG  does  allow  addition  and 


5.2.3  Processing  Order 

A  two-dimensional  DSP  algorithm  can  often  be  executed  using  several  processing  orders. 
This  was  demonstrated  in  the  previous  section  where  three  serial  processing  orders  were 
given  in  Table  5.1  for  the  2DFG  in  Figure  5.2(a).  A  linear  processing  order  is  specified 
using  a  scanning  vector  s  =  [  si  S2  and  an  access  vector  a  =  [  oi  02  ]^.  Lines 
orthogonal  to  the  scanning  vector  are  called  access  lines,  and  sample  (ni,n2)  on  access 
line  k  satisfies  nisj  -f-n2S2  =  k.  The  processing  order  is  such  that,  for  ki  <  k2,  all  samples 
on  access  line  ki  are  processed  before  the  samples  on  access  line  k2-  The  access  vector. 
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which  is  orthogonal  to  the  scanning  vector  (s  •  a  =  0),  defines  the  order  in  which  samples 
are  processed  on  the  access  lines,  such  that  sample  n  +  a  is  processed  immediately 
following  sample  n.  Lines  orthogonal  to  the  access  vector  are  called  scanning  lines, 
and  sample  (ni,n2)  on  scanning  line  k  satisfies  nioi  +71202  =  k.  As  an  example,  the 
processing  order  in  Figure  5.4  is  described  by  s  =  [  1  1  and  a  =  [  -1  1  ]^,  and 

sample  (2,4)  is  on  access  line  6  and  scanning  line  2.  In  addition  to  linear  processing 
orders,  nonlinear  processing  orders  such  as  the  Dovetail  scan  [72]  also  exist;  however, 
this  chapter  considers  only  linear  processing  orders. 

5.3  An  Integer  Linear  Programming  Formulation  of  2-D 
Retiming 

In  this  section  we  formulate  the  ILP  2-D  retiming  technique  which  considers  causality, 
the  desired  clock  period,  and  the  memory  cost  of  the  2-D  retiming  solution. 

5.3.1  Causality  in  2-D  Data  Processing 

A  dependency  w{e)  in  a  1-D  DFG  must  represent  a  causal  relationship.  If  the  edge 
u  —¥  V  has  a  negative  number  of  delays,  this  indicates  that  node  v  is  consuming  data 
before  node  u  has  produced  the  data,  and  this  is  not  practical  from  an  implementation 
point  of  view.  Causality  restricts  the  number  of  delays  on  an  edge  in  a  1-D  DFG  to  be 
nonnegative,  which  can  be  written  as  w{e)  >  0  for  all  e^E.  The  expression  w{e)  >  0  for 
all  e  G  can  be  viewed  as  the  condition  for  the  compatibility  between  the  dependencies 
and  the  order  in  which  the  data  is  processed  (which  is  dictated  by  time). 

In  2DFGs,  where  the  processing  order  is  specified  by  s  and  a,  there  axe  two  conditions 
for  the  compatibility  between  the  dependencies  and  the  processing  order.  These  two 
conditions  are  the  2-D  causality  constraints.  The  first  causality  constraint  states  that 
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a  dependency  w(e)  on  the  edge  u  v  cannot  point  from  access  line  k2  to  access  line 
A:i  for  k\  <  k^  because  this  would  indicate  that  the  data  produced  when  access  line  k2 
is  processed  is  consumed  when  access  line  ki  is  processed,  and  this  violates  causality 
because  access  line  k2  is  processed  after  access  line  ki.  Mathematically,  this  causality 
constraint  can  be  written  as 

Causality  Constraint  5.1  For  all  e  e  E,  s-  w(e)  >  0  must  hold. 

The  second  causality  constraint  states  that  if  the  dependency  w(e)  lies  in  the  same 
direction  as  the  access  lines,  then  the  dependency  cannot  point  in  the  opposite  direction 
as  the  access  vector  because  this  would  mean  that  the  dependency  points  to  the  opposite 
direction  of  processing  of  data.  This  can  be  expressed  as 

Causality  Constraint  5.2  For  all  e  e  E  such  that  s- w(e)  =  0,  a-w(e)  >  0  must  hold. 

Example  5.1  For  s  =  [  1  1  f  and  &  =  [  -1  1  ]^,  Figure  5.4  shows  how  four 

different  dependencies  would  affect  the  sample  at  the  (2,3)  location.  The  dependency 
w(e)  =  [  0  -1  Y  represents  a  non-causal  relationship  because  the  value  computed  when 

sample  (2,4)  is  processed  affects  the  value  at  sample  (2,3),  but  sample  (2,4)  is  processed 
after  (2,3).  This  dependency  violates  the  first  causality  constraint  because  s- w(e)  =  —1. 
The  dependency  w(e)  =  [  0  1  Y  represents  a  causal  relationship  because  the  value 

computed  when  sample  (2,2)  is  processed  affects  the  value  at  sample  (2,3),  and  sample 
(2,2)  is  processed  before  (2,3).  This  dependency  satisfies  the  first  causality  constraint 
because  s  •  w(e)  =  1.  The  dependency  w(e)  =  [  1  -1  Y  represents  a  non-causal  re¬ 

lationship  because  the  value  computed  when  sample  (1,4)  is  processed  affects  the  value 
at  sample  (2,3),  but  sample  (1,4)  is  processed  after  (2,3).  This  dependency  violates 
the  second  causality  constraint  because  a  •  w(e)  =  —2  and  s  •  w(e)  =  0.  The  depen- 
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dency  w(e)  =  [  -i  1  ^  represents  a  causal  relationship  because  the  value  computed 
when  sample  (3,2)  is  processed  affects  the  value  at  sample  (2,3),  and  sample  (3,2)  is 
processed  before  (2,3).  This  dependency  satisfies  the  second  causality  constraint  because 
a  •  w(e)  =  2  and  s  •  w(e)  =  0.  □ 

scanning  lines 


5.,  n,  '  '  '  ° 


Figure  5.4:  The  effect  of  four  dependencies  on  sample  (2. 3).  Processing  starts  at  sample 
(0,0). 

Let  Hjnax  be  the  maximum  number  of  samples  on  any  access  line.  Then  the  length 
of  the  longest  access  line  is  {Hmax  —  l)(a  •  a).  In  a  practical  situation,  the  length  of 
each  dependence  vector  is  not  greater  than  the  length  of  the  longest  access  line,  and  this 
implies  that  the  projection  of  a  dependence  vector  onto  the  access  vector  obeys 

Hmaxiet-  •  a)  >  |a  •  w(e)|.  (5.2) 

This  inequality  is  used  in  the  following  theorem  to  combine  the  two  causality  constraints 
into  a  single  constraint. 

Theorem  5.1  Let  (5.2)  hold  for  all  eeE.  Then 

^mai(a- a)(s  •  w(e))  +  a- w(e)  >  0  (5.3) 
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if  and  only  if  the  following  hold: 

1.  s  •  w(e)  >  0,  and 

2.  a  •  w(e)  >0ifs-  w(e)  =  0. 


Proof:  In  the  first  part  of  the  proof,  we  show  that  (5.3)  implies 

1.  s  •  w(e)  >  0,  and 

2.  a  •  w(e)  >  0  if  s  •  w(e)  =  0. 


The  expression  in  (5.3)  can  be  written  as 


s  •  w(e)  > 


-(a- w(e)) 


Hmax  (&  ■  S’) 

Using  (5.2),  this  can  be  written  as  s- w(e)  >  -1.  Since  s-w(e)  is  an  integer,  this  implies 
s  •  w(e)  >  0.  When  s  •  w(e)  =  0,  the  expression  in  (5.3)  simplifies  to  a  •  w(e)  >  0. 


In  the  second  part  of  the  proof,  we  show  that 


1.  s  •  w(e)  >  0,  and 

2.  a  •  w(e)  >  0  if  s  •  w(e)  =  0 

imply  (5.3).  If  s  •  w(e)  >  1,  then  (5.3)  holds  because  (5.2)  states  that  a  •  w(e)  > 
—Hmaxisi  -  a).  If  s  •  w(e)  =  0,  then  (5.3)  holds  because  a  •  w(e)  >  0.  □ 

If  we  let 

F(x)  =  ifmax(a  •  a)(s  •  x)  +  a  •  X, 

then  causality  can  be  written  as  F(w(e))  >  0  for  all  e  G  E.  This  definition  of  F(x)  is 
used  throughout  the  remainder  of  the  chapter.  For  a  retimed  2DFG  Gr,  causality  can 
be  written  as  F(wr(e))  >  0  for  all  e  E  E.  A  2-D  retiming  r  from  G  to  Gr  is  legal  if 
F(w(e))  >  0  for  all  e  E  E. 
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5.3.2  The  Clock  Period  Constraints 


In  this  section  we  develop  the  constraints  which  can  be  used  to  specify  a  desired  clock 
period  for  the  retimed  2DFG.  Let  p  =  vq  %  vx  %■■  ■  Vk  be  a  path  in  the  2DFG. 
The  delay  of  the  path  is  d(p)  =  Y!i=Q  and  the  dependency  of  the  path  is  w(p)  = 
The  clock  period  $((j)  is  defined  to  be  the  maximum  propagation  delay 
through  which  any  signal  must  ripple  between  clock  cycles.  Mathematically, 

$(G)  =  msx{d{p)  :  w(p)  =  0}. 

The  derivations  in  this  section  follow  the  derivations  in  [27]. 

Let 

W{u,v)  =  min{F(w(p))  :  u  A  u} 
and 

Z)(u,u)  =  max{d(p)  :  u  A  V  and  F(w(p))  =  W(u,u)}. 

Lemma  5.2  Let  G  be  a  2DFG,  and  let  c  be  any  positive  real  number.  The  following  are 
equivalent. 

5.2.1  $(G)  <  c. 

5.2.2  For  all  vertices  u  and  v  in  V,  if  D{u,v)  >  c,  then  W{u,v)  >  a-a. 

Proof:  (5.2.1  =>  5.2.2):  Suppose  $(G)  <  c  and  let  u  and  v  be  vertices  such  that 

D{u,v)  >  c.  Assume  that  W{u,v)  <  a-a.  If  all  edges  in  G  are  causal,  then  IF'(u,  u)  =  0, 

so  there  exists  a  path  u  A  u  with  propagation  delay  d{p)  =  G(u,  v)  >  c  and  F(w(p))  = 
W {u,  v)  =  0,  which  implies  w(p)  =  0  and  $(G)  >  c.  Contradiction. 

(5.2.2  5.2.1):  Suppose  5.2.2  holds  and  let  u  A  u  be  any  path  in  G  such  that 

F(w(p))  =  0.  Then  we  have  W{u,v)  =  F(w(p))  =  0,  which  implies  d{p)  <  D{u,v)  <  c 
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(this  is  the  contrapositive  of  “if  D{u,v)  >  c  then  W{u,v)  >  a  -  a”).  This  implies  5.2.1. 
□ 

A  critical  path  is  any  path  n  4  y  with  F(w(p))  =  W{u,v).  Assume  that  r  is  a  2-D 
retiming  that  satisfies  the  causality  constraints  for  a  given  processing  order  specified  by 
s  and  a.  Let  Wr{u^v)  and  Dr{u,v)  have  the  same  definitions  on  the  retimed  graph  Gr 
as  W^(u,w)  and  D{u,v)  have  on  G.  The  following  can  be  proven  using  techniques  similar 
to  those  used  for  the  1-D  case  [27]. 

•  Wr{u,  v)  =  W {u,  v)  +  F{r{v)  -  r{u)). 

•  a  path  p  is  a  critical  path  of  Gr  if  and  only  if  it  is  a  critical  path  of  G. 

•  Driu,v)  =  D{u,v)  for  all  connected  u,v  eV 

•  the  clock  period  ^(Gr)  is  equal  to  D(u,v)  for  some  u,v  E  V. 

Using  these  results,  we  can  prove  the  following. 

Theorem  5.3  Let  c  be  an  arbitrary  real  number  and  let  s  and  a  be  orthogonal  vectors 
which  specify  a  linear  processing  order.  Then  r  is  a  legal  retiming  such  that  $(Gr)  <  c 
if  and  only  if 

5.3.1  F{r{u)  —  r(u))  <  F(w(e))  for  every  edge  u  4  u  of  G,  and 

5.3.2  F{r{u)  —  r(t;))  <  W{u,v)  —  a.  -  a.  for  all  vertices  u,v  eV  such  that  D{u, v)  >  c. 

Proof:  The  retiming  is  legal  if  and  only  if  5.3.1  holds.  If  r  is  indeed  a  legal  retiming 
of  G,  then  by  Lemma  5.2  the  retimed  circuit  Gr  has  clock  period  $(Gr)  <  c  under  the 
condition  that  Wr(u,v)  >  a  ■  a  for  all  vertices  u,v  E  V  such  that  Dr(u,v)  >  c.  Since  we 
know  Dr(u,v)  =  D(u,v)  and  Wr(u,v)  =  W(u,v)  +  F(r(u)  -  r(u)),  Gr  has  ^(Gr)  <  c 
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under  the  condition  that  W{u,v)  >  -F{r{v)  -  r{u))  +  a  •  a  for  all  u,  u  G  7  such  that 
D{u,v)  >  c.  Since  F{r{v)  -  r{u))  =  -F{r{u)  -  r{v)),  this  is  equivalent  to  5.3.2.  □ 

5.3.3  The  Memory  Cost 

For  the  ILP  formulation  to  be  complete,  it  requires  a  linear  approximation  of  the  number 
of  registers  required  to  implement  the  retimed  circuit.  A  linear  approximation  for  the 
number  of  registers  required  to  implement  the  dependency  w(e)  should  consider  the 
number  of  access  lines  and  scanning  lines  crossed  by  the  dependency.  The  number  of 
access  lines  crossed  is  s  •  w(e),  and  the  maximum  number  of  samples  in  an  access  line 
is  if  man  so  an  upper  bound  on  the  number  of  registers  required  to  store  s  •  w(e)  access 
lines  is  ifmoi(s  •  w(e)).  The  number  of  scanning  lines  crossed  by  w(e)  is  a  •  w(e),  and 
one  register  is  required  for  a  -  a  scanning  lines  that  are  crossed  (to  see  this,  consider  that 
the  dependency  corresponding  to  a  single  sample  delay  is  w(e)  =  a);  so  an  estimate  for 
the  number  of  registers  required  due  to  scanning  lines  that  are  crossed  is  (a  •  w(e))/(a  • 
a).  The  linear  approximation  for  the  total  number  of  registers  required  to  implement 
the  dependency  w{e)  is  /fmai(s  •  'w(e))  +  (a  •  w(e))/(a  •  a),  which  can  be  written  as 
F(w(e))/(a-a). 

If  a  node  has  more  than  one  output  edge  carrying  the  same  signal  (such  a  node  is  often 
called  a  fanout  node),  the  number  of  registers  required  to  implement  these  edges  is  the 
maximum  number  of  registers  on  any  one  of  them  [21].  This  is  shown  in  Figure  5.5  for  the 
1-D  case,  where  the  naive  implementation  in  Figure  5.5(a)  uses  1  +  3  +  7  =  11  registers 
while  the  efficient  implementation  in  Figure  5.5(b)  uses  max(l,3,7)  =  7  registers.  Using 
this  concept,  the  number  of  registers  required  to  implement  the  output  edges  of  node  v 
is  estimated  to  be 

Rv  =  max{F(wr(e))/(a  •  a)}. 
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The  cost  function  can  be  minimized  by  using  COST  =  Ewev  Rv  where  >  F(wr(e)) 
for  all  edges  v  A?.  Note  that  this  cost  represents  the  number  of  memory  locations  scaled 
by  a  constant  scale  factor  (a  •  a). 


L>— (S) 

3D— @ 
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^2D  + 
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Figure  5.5:  (a)  Fanout  implementation  using  1  +  3  +  7  =  11  registers,  (b)  Fanout 
implementation  using  max(l,3, 7)  =  7  registers. 

5.3,4  The  Complete  ILP  2-D  Retiming  Formulation 

Theorem  5.3  specifies  the  conditions  for  a  retiming  to  be  legal  and  satisfy  a  given  clock 
period  constraint.  Combining  this  with  the  cost  function,  the  complete  ILP  formulation 
of  2-D  retiming  is:  Minimize  COST  =  J2vev  Rv  under  the  constraints 

1.  iTy  >  F(wr(e))  for  all  edges  v  A?  and  all  u  6  F  (fanout  constraint). 

2.  F{r{u)  -  r{v))  <  F(w(e))  for  every  edge  u  A  u  of  G  (causality  constraint). 

3.  F{r{u)  -  r{v))  <  W{u,v)  -  a  -  a  for  all  vertices  u,v  €  V  such  that  D{u,v)  >  c 
(clock  period  constraint). 

Example  5.2  Consider  the  2DFG  in  Figure  5.6(a).  Assume  that  the  computation  time 
for  each  node  is  1  time  unit.  The  goal  is  to  retime  this  2DFG  to  minimize  the  memory 
while  achieving  a  clock  period  o/  $(Gr)  =  1  assuming  an  8x8  data  set  and  a  processing 
order  specified  by  s  =  [  1  2  ]'^  and  a.  =  [  -2  1  ]'^.  The  maximum  number  of  samples 

on  an  access  line  is  Hmax  —  4  and  a-a  =  5,  so  F{x.)  =  20(s-x)+a-x.  The  ILP  formulation 
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is  to  minimize  COST  —  -Ri  +  i22  + -^3  +  i?4  subject  to  the  fanout  constraints,  the  causality 
constraints,  and  the  clock  period  constraints.  The  five  fanout  constraints  are 


Ri 

> 

0  +  F(r(2)  - 

r(l)) 

Ri 

> 

0  +  F(r(3)  - 

r(l)) 

R2 

> 

23  +  F(r(4)  ■ 

-  r(2)) 

Rz 

> 

59  +  F(r(4)  • 

-  r(3)) 

i?4 

> 

0  +  F(r(l)  - 

r(4)). 

The  five  causality  constraints  are 

F(r{l)-r(2))  <  0 
F(r(l)-r(3))  <  0 
F(r(2)-r(4))  <  23 
F(r(3)-r(4))  <  59 
F(r(4)-r(l))  <  0. 

The  values  ofW{u,v)  and  D{u,v)  are  given  in  Table  5.3,  and  based  on  these  values  the 

twelve  clock  period  constraints  are 

Fm -ri2))  <  -5 
F(r(l)-r(3))  <  -5 
F(r(l)-r(4))  <  18 
F(r(2)-r(l))  <  18 
F(r(2)-r(3))  <  18 
F(r(2)-r{4))  <  18 
F(r(3)-r(l))  <  54 
F(r(3)-r(2))  <  54 
F(r(3)  -  r(4))  <  54 
F(r(4)-r(l))  <  -5 
F(r(4)-r(2))  <  -5 
F(r(4)-r(3))  <  -5. 

The  retiming  solution,  found  using  the  ILP  solver  GAMS  [63],  is  r(l)  =  [  0  1  ]^, 
r(2)  =  {  3  Of,  r(3)  =  [  3  Of,  and  r(4)  =  [  2  Of.  The  values  of  R2,  R3, 
and  R4  are  13,  5,  41,  and  5,  respectively,  and  the  total  cost  is  COST  =  64.  The  retimed 
2DFG  is  shown  in  Figure  5.6(b). 


A  downfall  of  the  ILP  2-D  retiming  is  its  slow  convergence  time.  From  our  experiences, 
we  have  found  that  the  ILP  solver  can  take  several  minutes  to  find  an  optimal  solution 
for  2DFGs  with  as  few  as  12  nodes.  The  lineax  programming  formulation  in  the  next 
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section  can  be  solved  in  polynomial  time,  resulting  in  significantly  faster  solution  times 
than  ILP  2-D  retiming. 

5.4  Orthogonal  2-D  Retiming 

Orthogonal  two-dimensional  retiming  partitions  the  2-D  retiming  problem  into  two  1- 
D  retiming  problems.  These  1-D  retiming  problems,  which  we  call  s-retiming  and  a- 
retiming,  can  be  solved  in  polynomial  time  using  techniques  similar  to  those  introduced 
in  [27].  By  partitioning  the  2-D  retiming  problem  into  two  1-D  retiming  problems,  some 
quality  of  the  final  solution  may  be  sacrificed  because  the  final  solution  is  no  longer 
guaranteed  to  be  globally  optimal;  however,  our  experience  has  shown  that  orthogonal 
2-D  retiming  finds  solutions  that  are  comparable  to  the  ILP  solutions,  and  these  solutions 
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are  found  in  much  faster  CPU  times  than  the  ILP  solutions. 


Simply  stated,  orthogonal  2-D  retiming  is  performed  by  first  performing  s-retiming 
and  then  performing  a-retiming,  where  these  two  tasks  are  specified  below: 

•  s-retiming:  Project  the  2-D  retiming  problem  onto  the  s-vector  and  solve  this  1-D 
retiming  problem  to  find  the  values  of  s  •  Wr(e)  for  e  £  E. 

•  a-retiming:  Project  the  2-D  retiming  problem  onto  the  a-vector  and  solve  this  1-D 
retiming  problem  to  find  the  values  of  a  •  Wr(e)  for  e  £  E. 

The  following  subsections  describe  s-retiming  and  a-retiming  along  with  the  fanout  model 
used  in  orthogonal  2-D  retiming.  Throughout  these  subsections,  the  notations  and 
x(“)  are  used  to  denote  x  •  s  and  x  •  a,  respectively. 

5.4.1  Fanout  Model 


In  the  ILP  formulation  of  2-D  retiming  presented  in  Section  5.3,  the  fanout  constraint  is 
used  to  ensure  that  the  memory  required  by  the  output  edges  of  a  node  is  the  maximum 
memory  required  by  any  of  the  output  edges  of  the  node.  In  1-D  retiming  [27],  a  “gadget” 
is  used  to  model  the  fanout  node  so  the  memory  required  by  the  output  edges  of  the 
node  can  be  accurately  modeled  using  a  linear  programming  formulation.  Figure  5.7 
shows  a  similar  gadget  used  so  that  the  2-D  retiming  problem  cam  be  modeled  as  two 
linear  programming  problems. 


The  following  four  quantities  are  used  in  orthogonal  2-D  retiming 


^rjnax 

'^max  max 
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w. 


(“)  _ 
r,77iax 


max 


ei:wi'’\ei)=tui^^az 


{4“)(ei)} 


Note  that  Wmlx  are  known  from  the  unretimed  2DFG,  mffmai  and  WmL  are  known  after 
5-retiming  has  been  performed,  and  torfmox  are  known  after  s-retiming  and  a-retiming 
have  been  performed. 


Figure  5.7(a)  shows  a  fanout  node  with  k  output  edges.  The  gadget  in  Figure  5.7(b) 
is  used  to  model  the  fanout  node  in  a  2DFG.  Each  of  the  k  edges  ej,  1  <  i  <  k,  has 
an  associated  weight  w(ej)  which  is  known  from  the  2DFG.  The  node  u  is  a  dummy 
node  with  zero  computation  time  (d(u)  =  0),  and  the  edges  Cj,  1  <  i  <  it,  are  dummy 
edges  used  so  the  linear  programming  formulations  used  in  orthogonal  2-D  retiming  can 
accurately  model  the  memory  required  by  a  node  with  more  than  one  output  edge.  We 
call  the  edges  ii,  1  <  i  <  k,  auxiliary  edges. 


In  addition  to  the  weights  w(ei),  each  of  the  edges  et  has  the  associated  quantities 
cr(ei)  =  l/k  and 


7(ei) 


1/m  if  wi^\ei)  =  w[%ax 
0  otherwise 


where  m  is  the  number  of  edges  ej  satisfying  wf\ei)  =  Wrfmox  after  s-retiming  has  been 
performed.  Each  auxiliary  edge  in  Figure  5.7(b)  has  the  associated  quantities 


,(*) 

[ei) 

=  ’"max 

,(<») 

(ei) 

=  ’"mL 

—  w^^\ei) 

—  w^°'\ei) 


and  cr(ei)  =  l/k  and 


U  otherwise 


where  m  has  the  same  definition  as  it  has  in  7(ei). 
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/\ 


(a)  (b) 


Figure  5.7:  (a)  A  fanout  node  u.  (b)  A  gadget  used  to  model  node  u  in  the  linear 
programming  formulations  of  orthogonal  2-D  retiming. 

5.4.2  6-Retiming 

In  orthogonal  2-D  retiming,  s-retiming  affects  the  memory  requirements  of  the  retimed 
2DFG  more  than  a-retiming  because  s-retiming  deals  with  entire  delay  lines  while  o- 
retiming  deals  with  single  delays.  As  a  result,  s-retiming  is  performed  first  on  the  2DFG, 
and  then  a-retiming  is  performed. 

In  s-retiming  the  2-D  retiming  problem  is  projected  onto  the  scanning  vector.  Starting 
with  the  2-D  retiming  equation  in  (5.1),  we  can  take  the  dot  product  of  both  sides  of 
the  equation  with  the  scanning  vector  s  to  get 

s  •  Wr(e)  =  s  ■  w(e)  -f-  s  •  r(v)  -  s  •  r(u).  (5.4) 

Using  the  notation  x^^'>  to  denote  s  ■  x,  (5.4)  can  be  written  as 

-  r(®)(u).  (5.5) 

The  first  causality  constraint  in  Section  5.3.1  requires  s  •  Wr(e)  >  0  for  all  e  E  E,  which 

can  be  rewritten  as  u;^'’^(e)  >  0  for  all  e  E  E.  Using  this  and  (5.5)  results  in 

+  r(^^(v)  -  r(®)(u)  >  0  (5.6) 
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for  all  e  e  E.  The  second  causality  constraint  in  Section  5.3.1  and  the  clock  period 
constraint  in  Section  5.3.2  are  enforced  during  a-retiming. 

The  cost  function  for  s-retiming  is  the  total  number  of  access  lines  crossed  by  the 
dependencies.  This  can  be  written  as 

COST  =  ^  a{e)w^^^{e)  =  ^  a{e)w^^^e)  +  ^  a(e)(r^^^(v)  —  r^^^(u)),  (5.7) 

eeE  eeE  eeE 

where  (T{e)  is  the  weight  of  an  edge  according  to  the  fanout  model  in  Section  5.4.1.  The 
formulation  of  s-retiming  consists  of  minimizing  the  total  number  of  access  lines  crossed 
(i.e.,  minimize  COST  in  (5.7))  while  keeping  i(;^®^(e)  >  0  for  all  e  e  E  using  (5.6). 

Since  T.eeE  is  fixed,  s-retiming  can  be  stated  as:  Minimize 

COST'  =  53  <^(6)  -  S 

subject  to  for  all  eeE. 

Example  5.3  In  this  example,  we  perform  s-retiming  on  the  2DFG  in  Figure  5.6(a) 
assuming  s  =  [  1  2  and  &  =  [  -2  7  ]^.  Using  the  fanout  model  described  in 

Section  5.4.1,  the  2DFG  in  Figure  5.6(a)  is  redrawn  in  Figure  5.8(a),  where  node  5  is 
the  dummy  node  associated  with  fanout  node  1.  The  cost  function  is 

COST'  =  rW(l)(l-l)+rCI(2)(l-^)+r<')(3)(l-j) 

+r(")(4)  (2-l)-|-r(*)(5)(l-0) 

=  -7’(*)(2)  -  r(*)(3)  -I-  r(">(4)  -h  r(®)(5). 

The  s-retiming  problem  is  to  minimize  COST'  subject  to  the  following  seven  causality 
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constraints 


r-W(l)_rU)(2)  <  0 

rW(l)  _r(^)(3)  <  0 

rW(2) -r(")(4)  <  1 
rW{2) -r(^)(5)  <  0 
r(«)(3)  -  rW(4)  <  3 
r-(s)(3)  -  rW(5)  <  Q 

rW(4)-rW(l)  <  q, 

and  the  solution  found  using  the  linear  programming  solver  in  GAMS  [63]  is  =  1, 

rf*)(2)  =  1,  r^®)(3)  =  3,  r(®)(4)  =  0,  and  r^^^(5)  =  3.  The  result  of  s -retiming  is  shown 
in  Figure  5.8(b),  where  the  numbers  in  parentheses  represent  K;r'’^(e).  This  solution  is 
combined  with  the  results  of  a-retiming  in  Section  5.4-3  to  obtain  the  complete  orthogonal 
2-D  retiming  solution. 


Figure  5.8:  (a)  The  unretimed  graph  using  the  fanout  model,  (b)  The  result  of  s-retiming, 
where  the  numbers  in  parentheses  represent  iyr*^(e). 

The  s-retiming  formulation  accurately  models  the  memory  requirements  of  a  fanout 
node.  The  following  explanation  uses  the  notation  introduced  in  Section  5.4.1.  Let  the 
path  li  -V  Uj  -V  u  in  Figure  5.7  be  denoted  as  pj.  The  values  of  u;r*^(et)  are  made 
as  small  as  possible  under  the  constraint  wi^\ei)  >  0.  Therefore,  the  value  of  r^^^u) 
will  force  wi^\ei)  =  0  for  at  least  one  edge  which  we  call  ej  (i.e.,  Wr^\ej)  =  0).  Since 
niini<j<fc  =  wi^\ej)  and  the  retimed  path  weights  wf'^ijpi)  are  identical  for 

1  <  i  <  fc  (they  are  all  equal  to  —  r^^\u))  because  the  unretimed  path 
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weights  are  identical  (they  are  all  equal  to  we  know  =  w^r%ax. 

This  means  that 

The  total  cost  of  the  k  fanout  edges  is 


2  CT(e)u;(")(e) 

ee{ei,e<},l<i<fc 


^  a(e)7i;(®)(e) 

e€{ej,ej},l<t<fc 

+  o{e)  {r^^\v)  — 

^  f^)  + 1  E  -■“(*)  -  r  E 

\  ^  /  ^  l<i<k  *  i<i<k 


l<i<k 


u;(*) 

^r,maxi 


as  desired. 

5.4.3  a-Retiming 


In  a-retiming  the  2-D  retiming  problem  is  projected  onto  the  access  vector.  While  s- 
retiming  takes  the  first  causality  constraint  of  Section  5.3.1  into  account,  a-retiming 
takes  the  second  causality  constraint  and  the  clock  period  constraint  into  account.  Like 
s-retiming,  a-retiming  is  a  linear  programming  formulation  which  can  be  solved  in  poly¬ 
nomial  time. 


The  constraints  for  a-retiming  are  the  second  causality  constraint  in  Section  5.3.1  and 
the  clock  period  constraint.  Starting  with  (5.1),  we  can  take  the  dot  product  of  both 
sides  of  the  equation  with  the  access  vector  a  to  get 


a  •  Wr(e)  =  a  •  w(e)  +  a  •  r(u)  —  a  •  r(u). 


(5.8) 


144 


Using  the  notation  to  denote  a  •  x,  (5.8)  can  be  written  as 

10^“^  (e)  =  10^“^  (e)  +  (u)  —  r^“^(u).  (5.9) 

The  second  causality  constraint  in  Section  5.3.1  requires  wf“^(e)  >  0  for  all  e  E  E  such 
that  (e)  =  0.  Using  this  in  (5.9)  results  in 

(e)  +  (v)  -  (u)  >  0  (5. 10) 

for  all  e  G  such  that  u;r*^(e)  =  0. 

Clock  period  constraints  must  also  be  taken  into  account  during  a-retiming.  A  set  of 
constraints  for  a-retiming  is  formulated  such  that  the  clock  period  of  the  retimed  graph 
satisfies  ^(Gr)  <  c  for  some  desired  clock  period  c.  The  following  notations  are  used: 

kU(®)(u,u)  =  min{u;^*^(p)  :  u  A  v},  u,vEV 

Wj:^\u,v)  =  min{io^^^(p)  :  u  A  w},  u,v  eV 

W^‘^^u,v)  =  min{u;(“) (p)  :  u  A  w  and  ii;[®)(p)  =  v)},  u,vEV 

Wj:‘^\u,v)  =  min{u;^“^(p)  :  u  A  u  and  wlf^{p)  =  kU/®^(u,  v)},  u,v  EV 

D{u,v)  =  max{d(p)  ;  u  A  u  and  t(;^“^(p)  =  kU^“^(u,  u)},  u,v  EV 
Driu,v)  =  max{dr(p)  :  u  A  u  and  ta^“^(p)  =  Wj:°'\u,v)},  u,v  E  V 

The  following  two  lemmas  are  useful  for  finding  a-retiming  conditions  which  satisfy  a 
given  clock  period  constraint. 

Lemma  5.4  Let  r  be  a  legal  2-D  retiming  which  retimes  G  to  Gr-  The  following  hold: 
5.4.1  Wj:‘'\u,v)  =  lU(“)(u,u)  +  r{“)(t;)  -r(“)(u). 

5.4.3  Dr{u,v)  =  D{u,v). 
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Proof: 


(5.4.1) 

=  min{u;(“)(p)  :  u  A«  and 

=  min{u;(“)(p)  +  r(“)(u)  -  r(“>(u)  :  u  A  v  and  u;W(p)  =  PV/*)  (u,  v) } 
=  (u)  -  r(“)  (u)  +  min{i(;(“)  (p)  :  n  A  v  and  u; ^  (p)  =  W^')  (u,  u) } 

=  r (v)  -  (u)  +  (u,  v) 


(5.4.2)  We  can  use  d{p)  =  dr{p)  and  the  result  from  5.4.1  to  write 

Dr{u,v)  =  max{(ir(p)  :  u  A  u 

and  (p)  +  (u)  -  r(“)  (u)  =  (u,  u)  +  r(“)  (u)  -  (u) } 

=  max{(i(p)  :  u  A  u  and  to^“)(p)  =  W^“^(u,  v)} 

=  £>(u,  ?;).□ 


Lemma  5.5  For  a  legal  retiming  of  G ,  the  following  are  equivalent: 

5.5.1  $(G'r)  <  c. 

5.5.2  IfDr{u,v)  >  c  and  Wr^"^(u,u)  =  0,  then  w/“^(u,v)  >  a-  a. 

The  proof  of  Lemma  5.5  is  similar  to  the  proof  of  Lemma  5.2.  Lemmas  5.4  and  5.5 
are  used  to  prove  the  following. 


Theorem  5.6  Given  an  s-retiming  solution  such  that  r^^\u)  -  r^^^v)  <  for 

all  edges  u  v  in  E,  the  values  r(“)(u)  result  in  a  legal  2-D  retiming  of  G  such  that 
^{Gr)  <  c  if  and  only  if 

5.6.1  r(“)(u)  -  r(“)(u)  <  u;(“)(e)  for  all  e  E  E  such  that  w^/He)  =  0. 
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5.6.2  r^°‘\u)  —  r^°‘\v)  <  W^°'\u,v)— &■  a.  for  all  vertices  u,v  6  V  such  that  D{u,v)  >  c 
and  Wy\u,v)  =0. 

Proof:  5.6.1  is  simply  the  second  causality  constraint  for  a  legal  2-D  retiming.  If  5.6.1 
holds,  then  r  is  a  legal  retiming  and  by  Lemma  5.5  the  retimed  graph  Gr  has  clock 
period  $(Gr)  <  c  under  the  condition  Wr°'\u,v)  >  a  •  a  for  all  vertices  u,v  e  V  such 
that  Dr{u,  v)  >  c  and  Wr^\u,  v)  =  0.  Prom  Lemma  5.4,  we  know  Dr(u,  v)  =  D{u,  v)  and 
Wt  \u,v)  =  n)  —  r(“)(u).  Therefore,  Lemma  5.5  states  that  $(Gr)  <  c 

is  equivalent  to  5.6.2.  □ 

The  cost  of  a-retiming  is  the  weighted  number  of  scanning  lines  crossed,  given  by 
COST  =  7(e)u;(“)(e)  =  ^  7(e)«;(“)(e)  +  7(e)(r(“)(u)  -  r(“)(u)). 

eSE  eeE  eSB 

Since  J2e€E'yi^)'^^‘^H^)  is  fixed,  a-retiming  can  be  stated  as  follows:  Minimize 
COST'  =  Y,  j  X)  7(e)  -  X  ) 

subject  to 

1.  for  a.11  e  €  E  such  that  =  0. 

2.  -  a  •  a  for  all  u,  u  G  P  such  that  D{u,v)  >  c  and 
=0. 

Example  5.4  In  this  example,  a-retiming  is  performed  on  the  2DFG  in  Figure  5.6(a). 
Since  a-retiming  depends  on  the  results  of  s-retiming,  the  results  of  s-retiming  found  in 
Example  5.3  are  used  in  this  example.  The  2DFG  in  Figure  5.6(a)  is  redrawn  in  Fig¬ 
ure  5.9(a),  where  the  values  of  and  Wr^\e)  are  explicitly  shown.  We  assume  that 

the  computation  time  of  each  node  is  1  time  unit,  with  the  exception  that  the  computation 
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Table  5.4:  The  values  of  u),  W^°'\u,v),  and  D{u,v)  for  Example  5.4. 


1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

0 

0 

2 

0 

2 

1 

0 

0 

0 

3 

0 

2 

1 

0 

3 

0 

2 

2 

3 

0 

3 

3 

0 

3 

1 

1 

0 

0 

0 

3 

-1 

-1 

0 

-1 

0 

4 

1 

1 

3 

0 

3 

4 

0 

0 

0 

0 

0 

5 

- 

- 

- 

- 

0 

5 

- 

- 

- 

- 

0 

D{u,v) 

1  2  3  4  5 

1 

1  2  2  3  2 

2 

3  14  2  1 

3 

3  4  12  1 

4 

2  3  3  1  3 

5 

....  0 

time  of  the  dummy  node  5  is  zero.  The  goal  is  to  retime  the  2DFG  so  it  can  he  clocked 
with  a  clock  period  of  1  time  unit. 


w}‘)=  0  w}‘>=  0 


Figure  5.9:  (a)  The  2DFG  which  is  subjected  to  a-retiming  in  Example  5.4.  (b)  The 
results  of  s-retiming  and  a-retiming  for  the  2DFG  in  Figure  5.6(a).  These  results  are 
found  in  Examples  5.3  and  5.4. 


For  fanout  node  1,  WmLx  =  0,  lOrfmoi  =  2,  Wmax  =  0,  and  m  =  1.  The  values  of 
Wy\u,v),  and  D{u,v)  are  given  in  Table  5.4- 
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The  a-retiming  formulation  is  to  minimize 


COST’  =  r(“)(l)(l-l)+r(“)(2)(0-l)+r(“)(3)(l-2) 

+r(“)(4)(2-l)  +  rW(5)(l-0) 

=  (2)  -  (3)  +  r(“)  (4)  +  (5) 


subject  to 

r-(")(i)  _r(“)(2)  <  0 
rW(2) -r(“)(4)  <  3 
r(“)(3) -r(“)(4)  <  -1 
r(a)(3) <  0 
r(“)(l) -r{“)(2)  <  -5 
r(“)(l)-r(“)(4)  <  -2 
r(“)(2)-r(“)(4)  <  -2 
r(“)(3)-r(“)(4)  <  -6. 

The  a-retiming  solution  found  using  the  linear  programming  solver  in  GAMS  [63]  is 
r(“)(l)  =  -7,  r(“)(2)  =  -2,  rW(3)  =  -6,  r(“)(4)  =  0,  and  r(“)(5)  =  -6.  The  2DFG  is 
drawn  in  Figure  5.9(b)  with  the  results  of  s -retiming  (from  Example  5.3)  and  a-retiming 
shown. 


We  can  show  that  the  a-retiming  formulation  accurately  models  the  memory  require¬ 
ments  of  a  fanout  node  when  the  practical  restriction 

|ry(“^(e)|  <  i/^ai(a  ■  a)/2. 

is  enforced.  Assume  that  F(wr(e))  is  used  to  estimate  the  memory  required  by  the 
edge  e. 

Lemma  5.7  If  w[^\ei)  <  then  F(wr(ei))  <  F(wr(ej)). 

Proof: 

w^^^Ci)  <  wj.^^(ej)  =>  w^^^(ei)  -1- 1  < 
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Hmax[&  •  a)w'f\ei)  +  ifmax(a  •  a)  <  Hmax{a '  {ej) 

=>  Hmaxia-  ■  a.)w^/Hei)  +  Hmaxisi  ■  a)/2  <  Hmax{a  ■  a)u;f^)(ej) 

~Hmaxi^  ‘  (5-11) 

Using  wi  ^(cj)  <  Hmaxi^  •  a)/2  and  wf'\ej)  >  —Hmaxi^ '  a)/2,  we  can  write  the 
inequalities 

Hmaxia.  ■  a)u;f")(ei)  +  <  Hmaxisi  •  a)uj(*)(ei)  +  Hmaxi&  ‘  a)/2 

and 

Hmaxia.  •  a)uj('’)(ej)  +  >  Hmaxisi  •  a)u;[^)(ej)  -  i/max(a  •  a)/2. 

Combining  these  with  the  inequality  in  (5.11)  results  in 

Hmaxi^  ■  a  )u;W(ei)  +  u;W(ei)<  Hmax  {a-a.)wlf'>{ej)  +  w!f!^Hej) 

=4>  F{wr(ei))  <  F(wr(ej)).n 

The  following  explanation  uses  the  notation  introduced  in  Section  5.4.1.  From  Lemma 
5.7,  we  know  that  for  a  node  u  with  k  output  edges,  the  edge  ej  which  satisfies  F{wr{ej))  > 
■^(wr(ei)),  1  <  i  <  A;,  must  obey  wi^\ej)  =  lOrfmai  after  s-retiming.  Given  that 
wi^^(ej)  =  roffmai,  from  the  definition  of  F(-)  we  can  see  that  the  edge  ej  which  sat¬ 
isfies  F(wr(ej))  >  F(wr(et)),  1  <  ?  <  A:,  also  satisfies  ri;r“^(ej)  =  Wr%ax-  To  summarize, 
the  edge  ej  which  satisfies  F(wr(ej))  >  F(wr(€i)),  1  <  i  <  k,  satisfies 
and  Wr  (Cj)  —  ^r,maX' 

The  goal  now  is  to  show  that  the  cost  of  the  fanout  node,  given  by 

H  7(e)w;^“^(e), 

e6{ei,ei},l<t<fc 

is  equal  to  WT%.ax-  Let  the  path  u  Ut  ti  in  Figure  5.7(b)  be  denoted  as  pj.  The  only 
auxiliary  edges  which  affect  the  cost  function  are  those  with  it;^*^(ej)  =  0  because  7(ei)  = 
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0  for  any  auxiliary  edge  with  Wr^\ei)  >  0.  For  the  auxiliary  edges  with  Wr^\ei)  =  0, 
the  values  of  Wr°'\ei)  are  made  as  small  as  possible  under  the  constraint  u;r“^(ei)  >  0. 
Therefore,  the  value  of  will  force  Wr  =  0  for  at  least  one  edge  which  satisfies 

w["\ei)  =  0.  Let  this  edge  with  =  0  and  w\."\ei)  =  0  be  the  edge  e^.  Since 

min  W“nei)} 

and  the  retimed  path  weights  Wr  ^(pi)  are  identical  for  1  <  i  <  A:  (they  are  all  equal  to 
+  -r(“)(u))  because  the  unretimed  path  weights  w  are  identical  (they 

are  all  equal  to  w\riax)-,  we  know  Wr^\ej)  =  w^)nax-  This  means  that 

=  u;(“)(ej)  +u,{“)(ej)  = 

The  total  cost  of  the  k  fanout  edges  is 

Y,  7(e)u^^“^(e)  =  7(e)u;(“)(e) 

e€{ei,ei},l<i<fc  e6{ej,ei},l<i<Jt 


+  S  7(e)  (r(“)(v)  -  r(“)(u)j 
e€{e,',ej},l<i<fc 


as  desired. 


5.4.4  Combining  the  results  of  s-retiming  and  a-retiming 


The  results  of  s-retiming  and  a-retiming  must  be  combined  to  get  the  retimed  2DFG. 
From  Wr  ^(e)  =  Wr(e)  •  s  and  =  Wr(e)  •  a,  we  can  write 


ii;r*^(e) 

wi°'\e) 
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so  Wr(e)  can  be  computed  using 


Wr(e) 


■  ■ 

-1 

tar*^(e) 

a^ 

. 

w[°‘\e)  _ 

(5.12) 


Example  5.5  For  the  retiming  performed  in  Examples  5.3  and  5-4,  the  processing  order 
was  specified  by  s  =  [  1  2  ]'^  and  a.  =  [  -2  1  ]^  ■  Using  these  values  in  (5.12)  gives 


1 

‘  1 

-2  ' 

rur^^(e) 

5 

2 

1 

w\r°'\e)  _ 

Applying  this  to  the  results  shown  in  Figure  5.9(b)  gives  the  retimed  2DFG  shown  in 
Figure  5.10,  which  is  the  result  of  applying  orthogonal  2-D  retiming  to  the  2DFG  in 
Figure  5.6(a). 


Figure  5.10;  The  result  of  performing  orthogonal  2-D  retiming  on  the  2DFG  in  Fig¬ 
ure  5.6(a). 


A  problem  with  orthogonal  2-D  retiming  is  that  s-retiming  and  a-retiming  may  give 
incompatible  results.  To  show  this,  we  consider  an  alternative  solution  to  a-retiming  in 
Example  5.4.  The  solution  r(“)(l)  =  -8,  r(“)(2)  =  -2,  r(“)(3)  =  -6,  rW(4)  =  0,  and 
r(“)(5)  =  —6  has  the  same  cost  and  satisfies  all  of  the  a-retiming  constraints;  however, 
this  new  a-retiming  solution  is  not  compatible  with  the  s-retiming  solution  found  in 
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Example  5.3.  To  see  this,  note  that  for  the  edge  4  A  1,  we  found  =  1  in 

Example  5.3  and  our  new  solution  to  a-retiming  gives  Wr°'\e)  =  0  +  (—8)  —  0  =  —8,  so 
the  dependency  for  this  edge  in  the  retimed  2DFG  is 


1 

’  1  -2  ' 

1  ' 

17/5  ' 

5 

2  1 

-8 

-6/5 

Since  this  dependence  vector  has  non-integer  elements,  the  retimed  2DFG  is  not  practical. 
The  following  section  introduces  a  variation  of  orthogonal  2-D  retiming  which  guarantees 
that  the  retimed  dependencies  have  integer  elements  for  a  common  set  of  processing 
orders. 


5.5  Integer  Orthogonal  2-D  Retiming 


Integer  orthogonal  2-D  retiming  can  be  used  to  guarantee  that  the  edge  dependence 
vectors  have  integer  elements  when  the  scanning  vector  has  the  form  s  =  1  k 

or  s  =  [  k  1  ]  )  where  A;  is  a  nonnegative  integer.  Similar  to  orthogonal  2-D  retim¬ 
ing,  s-retiming  and  a-retiming  are  used  in  integer  orthogonal  retiming,  but  a-retiming 
is  manipulated  in  integer  orthogonal  retiming  so  the  dependencies  are  guaranteed  to 
have  integer  elements.  Since  integer  orthogonal  retiming  consists  of  solving  two  linear 
programming  problems,  it  can  be  solved  in  polynomial  time. 

5.5.1  a-retiming  for  the  Sx  =  1  Case 


The  first  constraint  for  a-retiming  is  A°'\u)  -  <  ti;(“)(e)  for  all  edges  u  A  v  in  E 

such  that  (e)  =  0.  This  can  be  written  as 

(  rx(u) 

\[  ryiu) 

for  all  e  G  E  such  that  Wr^^(e)  =  0.  From  r^^\u)  =  r{u)  -s,  we  know  rx{u)sx  +  ry{u)sy  = 
r^®)(u),  which  implies  rx{u)  =  —  ry{u)sy  because  =  1  is  assumed.  Substituting 


ry{v) 


a  <  w^‘‘^(e) 


(5.13) 
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this  expression  for  r^iu)  into  (5.13)  gives 


f  -  ry{u)Sy 

\[  ry{u) 


-ry{v)Sy 

ry{v) 


a  <  10^“^ (e). 


(5.14) 


Assuming  that  a  and  s  are  related  by  =  -Sy  and  Oj,  =  s*  =  1,  (5.14)  can  be  written 
aa 


Sy{r^^\u)  -  ry{u)sy  -  +  rj,(u)sj,)  +  {ry{u)  -  ry{v))  <  (5.15) 

Since  the  first  constraint  for  a-retiming  applies  to  the  edges  with  w[^\e)  =  0,  this  implies 
u;(*)(e)  =  rW(u)  -r(^)(u),  so  we  can  replace  A^\u)-r^^\v)  with  u;(^)(e)  in  (5.15)  to  get 

-sj,u;(^)(e)  +  {ry{u)  -  r„(u))(l  +  sj)  <  w^°-\e). 

Expanding  u;(*)(e)  =  SxWa:{e)  +  SyWy{e)  and  u;(“)(e)  =  -SyWx{e)  +  SxWy{e)  results  in 

-Sj,(sxuix(e)  +  SyWy{e))  +  (rj,(u)  -  ry(t;))(l  +  s])  <  -SyWj;{e)  +  Sxu;j,(e), 

which  can  be  rewritten  using  s^  =  1  as 

-Syw^{e)  -  slwyie)  +  {ry{u)  -  rj,(u))(l  +  sj)  <  -Sj,u;x(e)  +  Wyie) 

=►  (rj,(u)  -  ry(v))(l  +  si)  <  Wyie){l  +  si) 

ryiu) -ry{v)  <Wy{e). 

Therefore,  the  first  constraint  for  a-retiming  when  s  =  [  1  k  is  ry{u)-ry{v)  <  Wy{e) 
for  all  e  G  JS  such  that  4*)  =  0. 

The  second  constraint  for  a-retiming  is  r(“)(u)  -  r(“)(i;)  <  W^°-\u,v)  -  a  •  a  for  all 
u,veV  such  that  D{u,v)  >  c  and  =  0.  Using  the  left-hand-side  of  (5.15)  to 

substitute  for  r(“)(u)  -  r(“)(i;),  this  can  be  written  as 

-Sy (r(^)  (u)  -  r (u) )  +  {ry (u)  -  Tj,  (u) )  ( 1  +  sj)  <  (u,  u)  -  a  •  a 
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for  all  u,v  gV  such  that  D{u,v)  >  c  and  Wr^\u,v)  =  0.  Solving  for  ry{u)  —  ry{v),  the 
second  constraint  for  a-retiming  can  be  written  as 

ryiu)  -  ryiv)  <  + 

l  +  sl 

for  all  u,v  eV  such  that  Z)(u,t;)  >  c  and  =  0.  The  left-hand  side  of  this 

inequality  must  be  an  integer,  but  the  right-hand  side  is  not  guaranteed  to  be  an  integer 
(this  occurs  in  Example  5.6),  so  we  can  rewrite  this  inequality  as 

ry{u)  -  ryiv)  <  I  ^^°nn,n)-a-a  +  s,(r(^)(u)  -  r(^)(u))  | 


l  +  sl 


for  all  u,v  eV  such  that  D{u,v)  >  c  and  =  0. 

The  cost  function  for  a-retiming  is 


COST'  =  j  S 

‘'^V'  \7A„  ^47  / 


If  we  let  ky  =  (1^744, 7(e)  -  I!)„47  7(e)))  then  the  cost  can  be  written  as 
COST'  =  53(“'®y^a:(u) +rj,(u))A:„ 

vev 

=  IZ  (-■Sy(r^*nw)  -  rj,(u)sj,)  4-  ry{v))ky 
v^V 

=  +Sy)*v- 

v€V  vev 

During  a-retiming,  Z)ugv'(~ey^^*^(v))^t;  and  (1  -l-Sy)  are  constant  values,  so  minimizing 
COST'  is  equivalent  to  minimizing 


COST"  =  ^  Tyiv)  j  Y,  7(e)  -  Y  'Y(e) 

\74„  „47 


Summarizing,  the  a-retiming  formulation  for  the  case  when  s  =  [  1  k  j  ^  is  given 


by:  Minimize 


subject  to 


COST"  =  5Z  rj,(u)  {  ^  7(e)  -  Y  7(e) 
V?4v  vA? 
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1.  ry{u)  —  ry{v)  <  Wy{e)  for  all  e  €  S  such  that  'wf\e)  =  0. 

2.  Vyiu)  -  ry[v)  <  ^ 

c  and  Wy\u,v)  =  0. 


After  solving  for  the  values  of  ry(v),  the  values  of  ri(v)  can  be  computed  using 

rx(v)  =r(*)(u)  -ry{v)sy. 

Example  5.6  In  this  example,  we  use  the  integer  orthogonal  retiming  formulation  for 

■iT 

the  case  where  s  =  Ik  to  retime  the  2DFG  shown  in  Figure  5.11(a)  assuming 

r  iT  r  iT 

s  =  I  i  1  \  and  a.  =  -1  1  .  The  desired  clock  period  is  2  units  of  time,  and 

addition  and  multiplication  are  assumed  to  take  1  and  2  units  of  time,  respectively.  The 
result  of  s-retiming  is  shown  in  Figure  5.11(b),  where  the  numbers  on  the  edges  are  the 
values  ofwr\e). 

Figure  5.11(c)  shows  the  2DFG  in  Figure  5.11(a)  with  the  auxiliary  edges  included  to 
properly  model  the  fanout  of  node  1.  Since  the  integer  orthogonal  retiming  formulation 
uses  the  values  of  Wy{e)  for  all  e  &  E,  the  values  of  Wy{e)  on  the  auxiliary  edges  in 
Figure  5.11(c)  are  computed  using 

J  [  u;(“)(e)  ’ 

Then  a-retiming  consists  of  minimizing 


COST"  =  r(“)(2)  +  r(“)(3)  -  r(“)(4)  -  rW(5)  -  r(“)(6)  +  r(“)(7) 
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Figure  5.11:  (a)  The  2DFG  which  is  retimed  in  Example  5.6.  (b)  The  result  of  s-retiming. 
(c)  The  2DFG  showing  the  dependencies  on  the  auxiliary  edges,  (d)  The  retimed  2DFG 
which  achieves  the  desired  clock  period  of  2  time  units. 


subject  to  the  causality  constraints 

<  1 

rv(l)-rj,(5)  <  0 

^y(l)-^y(6)  <  1 

^y(3)-M2)  <  0 
ry{4)-ry{2)  <  0 
ry{^)-ry{7)  <  0 

^y(3)~ry(3)  <  0 
<  1 

^y(6)-ry(7)  <  0 


Table  5.5:  The  values  of  u),  u),  and  D{u,v)  for  Example  5.6. 

1  2  3  4  5  6  7  W^‘^'>{u,v)  1  2  3  4  5  6  7 

1  0000000  1  01  -1  1  -1  01 

2  1  0  1  1  1  1  1  2  0  0  -1  1  -1  0  1 

3  1001111  3  0001  -1  01 

4  1010110  4  00  -1  0  -1  00 

5  1001010  5  0001002 

6  2112200  6  0001  -1  01 

7  0  7  0 

D{u,v)  1  2  3  4  5  6  7 

1  1  4  4  3  3  3  3 

2  2  1  5  4  4  4  4 

3  3  2  1  5  5  5  5 

4  4372662 

5  5437272 

6  5437722 

7  . 0 


and  the  clock  period  constraints  (which  use  the  information  in  Table  5.5) 

ry(l)-rj,(2)  <  0 
ry(l)—  ry(3)  <  —1 
^y(l)-ry(4)  <  0 
rv(l)-ry(5)  <  -1 

<  0 

ry(l)-ry(7)  <  -1 
^y(4)-ry(2)  <  -1 

ry(5)-ry(2)  <  -1 

rs,(5)-ry(3)  <  -1. 

The  retimed  2DFG  is  shown  in  Figure  5.11(d). 


5.5.2  a-retiming  for  the  Sy  =  1  Case 


Using  the  same  techniques  as  those  used  in  Section  5.5.1  to  manipulate  a-retiming,  we 
can  find  that  a-retiming  has  the  following  formulation  when  s  =  [  k  1  1  . 
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Minimize 


COST"  =  '^{-rj;{v))  [  7(e)  -  7(e) 


vev 


t)4? 


subject  to 


1.  rx{u)  —  rx{v)  >  Wx{e)  for  all  e  €  E  such  that  Wr^\e)  =  0. 


2.  ri(u)-ri(u)  > 

cand  =0. 


for  all  u,v  €V  such  that  D{u,  v)  > 


After  solving  for  the  values  of  rx{v),  the  values  of  ry{v)  can  be  computed  using 
ry{v)  =  r^^Hv)  -  rx{v)sx. 

5.6  Comparisons 


In  this  section  we  compare  the  results  of  using  our  ILP  2-D  retiming  technique  and 
our  orthogonal  2-D  retiming  technique  with  the  previously  published  chained  [34]  and 
schedule-based  [33]  2-D  retiming  approaches. 

Comparisons  for  the  2DFGs  in  Figure  5.6(a)  and  Figure  5.13(a)  are  given  in  Table  5.6 
and  Table  5.7,  respectively.  The  results  in  these  tables  assume  that  the  computation  time 
of  each  node  is  one  time  unit,  the  desired  clock  period  is  one  time  unit,  and  the  2DFG 
operates  on  a  256  x  256  data  set.  Because  the  number  of  registers  required  by  the  retimed 
2DFG  is  not  the  same  for  each  of  the  256^  iterations,  the  number  of  registers  required  by 
the  retimed  2DFGs  is  determined  by  computing  the  memory  required  for  each  of  the  256^ 
iterations  and  taking  the  maximum  of  these  values.  To  demonstrate  this,  the  memory 
requirement  for  the  2DFG  in  Figure  5.12(a)  is  computed  assuming  a  4  x  4  data  set  and 
processing  order  specified  by  s  =  [  1  1  ]^  and  a  =  [  -1  1  ]^.  At  the  beginning 

of  iteration  [1  2  ]^,  the  four  samples  which  must  be  stored  due  to  the  dependency 
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[  1  0  are  indicated  in  Figure  5.12(b)  with  an  “x”  and  the  one  sample  which  must 

be  stored  due  to  the  dependency  [  -1  1  is  indicated  with  an  “o”.  Therefore,  the 

iteration  [1  2  requires  that  5  samples  are  stored.  The  reader  can  verify  that  the 
iteration  [1  1  requires  that  4  samples  are  stored,  the  iteration  [2  2  requires 
that  5  samples  are  stored.  The  maximum  number  of  samples  that  must  be  stored  for 
any  iteration  is  5,  so  this  2DFG  requires  5  registers. 


(a)  (b) 

Figure  5.12;  (a)  A  2DFG.  (b)  The  samples  which  must  be  stored. 

Because  the  2DFG  in  Figure  5.6(a)  is  small,  the  ILP  2-D  retiming  technique  described 
in  Section  5.3  was  used  to  obtain  the  results  in  Table  5.6.  Note  that  the  minimum  length 
scanning  vector  feasible  for  this  DFG  with  schedule-based  retiming  is  s  =  [  1  4  ]^.  Due 

to  the  relatively  large  size  of  the  2DFG  in  Figure  5.13(a),  the  orthogonal  2-D  retiming 
technique  in  Section  5.4  was  used  to  obtain  the  results  in  Table  5.7.  Since  orthogonal  2-D 
retiming  resulted  in  dependence  vectors  with  integer  elements,  it  was  not  necessary  to  use 
integer  orthogonal  retiming  for  this  2DFG.  Figure  5.6(b)  shows  the  retimed  version  of  the 
2DFG  in  Figure  5.6(a)  for  s  =  [  1  2  and  a  =  [  -2  1  ]^,  and  Figure  5.13(b)  shows 
the  retimed  version  of  the  2DFG  in  Figure  5.13(a)  for  s  =  [  1  1  and  a  =  [  1  1  ]^. 

From  Tables  5.6  and  5.7,  we  can  observe  that  the  “schedule-based”  retiming  technique 
in  [33]  does  not  find  a  solution  for  any  of  the  processing  orders  chosen.  This  is  because 
our  techniques  have  less  stringent  (but  still  sufficient)  causality  constraints  than  the 
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Table  5.6:  Memory  requirements  after  retiming  the  circuit  in  Figure  5.6(a)  assuming  a 
256  X  256  data  set. _ 


scanning 

vector 

retiming 

technique 

number  of 
registers 

s  =  [  0  1 

ours 

258 

chained 

510 

schedule-based 

no  solution 

s  =  [  1  2  p 

ours 

385 

chained 

511 

schedule-based 

no  solution 

Table  5.7:  Memory  requirements  after  retiming  the  circuit  in  Figure  5.13(a)  assuming  a 
256  X  256  data  set. 


scanning 

vector 

retiming 

technique 

number  of 
registers 

s  =  [  1  2  ]T 

ours 

778 

chained 

1794 

schedule- based 

no  solution 

8  =  [  1  If 

ours 

1032 

chained 

2048 

schedule-based 

no  solution 

S  =  [  2  1  IT 

ours 

780 

chained 

1288 

schedule-based 

no  solution 
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schedule-based  technique.  Thus,  our  techniques  are  compatible  with  more  processing 
orders.  We  can  conclude  that  our  techniques  offer  more  flexibility  than  the  schedule- 
based  retiming  technique  because  our  techniques  are  compatible  with  more  processing 
orders. 

We  can  also  conclude  from  Tables  5.6  and  5.7  that  our  techniques  result  in  solutions 
which  require  considerably  less  memory  than  the  chained  retiming  technique  in  [34]. 
This  is  because  our  formulations  are  not  sensitive  to  the  memory  requirements  of  the 
unretimed  2DFG,  while  the  results  of  chained  retiming  are  dependent  on  the  memory 
requirements  of  the  unretimed  2DFG. 

5.7  Conclusions 

In  this  chapter  we  have  presented  two  techniques  for  retiming  2DFGs.  These  two  tech¬ 
niques  attempt  to  minimize  the  amount  of  memory  required  to  implement  the  2DFGs 
under  a  given  clock  period  constraint.  The  ILP  2-D  retiming  technique  solves  the  entire 
2-D  retiming  problem  as  a  whole  but  requires  long  run  times  to  solve.  As  a  result,  this 
technique  should  be  used  only  for  small  2DFGs.  Orthogonal  2-D  retiming  runs  faster 
than  the  ILP  technique  but  occasionally  gives  incompatible  results  between  s-retiming 
and  a-retiming.  Therefore,  orthogonal  2-D  retiming  should  be  used  when  the  2DFG  is 
too  large  to  solve  using  ILP  2-D  retiming,  and  integer  orthogonal  2-D  retiming  should  be 
used  when  orthogonal  2-D  retiming  gives  incompatible  results  between  s-retiming  and 
a-retiming. 

Our  comparisons  have  shown  that  the  techniques  presented  in  this  chapter  give  con¬ 
siderably  better  results  than  previously  published  techniques.  In  fact,  our  techniques  can 
result  in  retimed  2DFGs  which  require  less  than  50%  of  the  memory  hardware  required 


162 


by  the  technique  in  [34].  Our  techniques  perform  better  than  the  technique  in  [33]  be¬ 
cause  our  formulations  have  less  stringent  (but  still  sufficient)  causality  constraints,  and 
they  perform  better  than  chained  retiming  in  [34]  because  our  formulations  are  not  sen¬ 
sitive  to  the  memory  requirements  of  the  unretimed  2DFG,  while  the  results  of  chained 
retiming  are  dependent  on  the  memory  requirements  of  the  unretimed  2DFG. 

Future  research  should  be  directed  toward  studying  the  interactions  between  inter¬ 
iteration  parallelism  and  inter-operation  parallelism  and  toward  finding  algorithms  for 
retiming  data-flow  graphs  which  operate  on  signals  which  have  dimensionality  greater 
than  two  for  applications  such  as  video  processing.  Register  minimization  in  2-D  retiming 
which  includes  the  use  of  scanning  order  conversion  requires  further  study.  Retiming  for 
folding  for  the  one-dimensional  case  has  been  studied  in  [28].  Two-dimensional  retiming 
for  folding  of  2DFGs  is  another  topic  of  further  research. 
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(b) 

Figure  5.13:  (a)  A  2-D  HR  filter,  (b)  A  retimed  version  of  the  filter. 
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Chapter  6 


Conclusions  and  Future  Research 
Directions 

6.1  Conclusions 

We  have  considered  several  formal  techniques  for  mapping  DSP  algorithms  to  VLSI 
architectures.  The  salient  features  of  these  techniques  are  that  they  increase  the  un¬ 
derstanding  of  the  interaction  between  algorithms  and  architectures,  and  they  provide 
methods  for  designing  new  and  improved  architectures  for  a  wide  variety  of  DSP  algo¬ 
rithms. 

A  new  formulation  of  scheduling  was  presented  in  Chapter  2.  Using  this  formula¬ 
tion,  we  showed  that  retiming  is  a  special  case  of  scheduling,  and  we  described  the 
interaction  between  retiming  and  scheduling  in  a  mathematical  framework.  Algorithms 
were  developed  for  exhaustively  generating  all  retiming  and  scheduling  solutions  for  a 
strongly  connected  DFG.  By  carefully  choosing  the  examples  in  this  chapter,  we  have 
given  scheduling  solutions  for  many  filters  which  are  of  interest  to  the  high-level  synthesis 
community.  This  community  should  find  the  scheduling  results  for  the  biquad  filter  and 
the  fifth  order  wave  digital  elliptic  filter  to  be  of  particular  interest. 
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New  expressions  were  introduced  in  Chapter  3  for  computing  the  minimum  number 
of  registers  required  to  implement  a  statically  scheduled  DFG.  Two  cases  are  consid¬ 
ered,  namely,  the  cases  where  retiming  is  and  is  not  allowed  after  the  DFG  has  been 
scheduled.  These  results  should  be  useful  in  CAD  tools  used  to  design  memory-efficient 
architectures. 

The  multirate  folding  transformation  was  developed  in  Chapter  4.  Within  the  scope 
of  multirate  folding,  the  problems  of  retiming  for  multirate  folding  and  register  minimiza¬ 
tion  in  (multirate)  folded  architectures  were  also  considered.  Together,  the  formulations 
of  multirate  folding,  retiming  for  multirate  folding,  and  register  minimization  provide  a 
new  technique  for  designing  single-rate  VLSI  architectures  for  multirate  DSP  algorithms, 
such  as  the  discrete  wavelet  transform. 

In  Chapter  5,  two  techniques  for  2-D  retiming  were  presented,  namely,  ILP  2-D 
retiming  and  orthogonal  2-D  retiming.  These  techniques  can  reduce  the  memory  usage 
in  2-D  DSP  implementations  by  over  50%.  This  is  of  particular  importance  due  to 
the  recent  high  demand  for  low  cost  and  low  power  implementations  of  2-D  DSP  for 
multimedia  applications. 

6.2  Future  Research  Directions 

The  work  presented  in  this  thesis  provides  the  foundation  for  several  interesting  future 
research  projects.  In  the  area  of  exhaustive  scheduling  and  retiming,  it  would  be  in¬ 
teresting  to  include  unfolding  [62]  in  the  formulation.  Since  a  formulation  is  given  in 
Chapter  2  for  folding,  it  seems  natural  that  a  similar  formulation  can  be  derived  for 
unfolding,  since  unfolding  is  essentially  the  inverse  operation  of  folding.  A  formulation 
which  includes  retiming,  folding,  and  unfolding  would  be  interesting  from  a  theoretical 
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point  of  view  as  well  as  a  practical  point  of  view. 


In  the  area  of  register  minimization,  we  have  solved  the  problem  of  computing  the 
number  of  registers  required  by  a  scheduled  DSP  algorithm,  but  the  problem  of  allo¬ 
cating  data  to  these  registers  is  an  open  problem.  Although  several  excellent  heuristic 
techniques  have  been  suggested  (e.g.,  in  [51],  [52],  and  [53]),  the  topic  of  memory  man¬ 
agement  will  be  an  open  problem  for  many  years  due  to  the  large  percentage  of  chip  area 
which  must  be  dedicated  to  memory. 

In  the  area  of  multirate  synthesis,  the  topics  of  retiming  [35]  and  scheduling  [55] 
for  multirate  DFGs  are  still  under  examination.  The  study  of  these  topics  and  the 
development  of  formulations  for  retiming  and  scheduling  similar  to  those  in  Chapter  2 
(but  for  the  multirate  case)  would  be  both  useful  and  interesting. 

In  the  area  of  multi-dimensional  retiming,  2-D  retiming  with  non-linear  scanning 
orders,  such  as  the  Dovetail  scan  [72],  would  be  an  interesting  extension.  Future  research 
should  also  take  into  account  the  cost  of  scan  conversion  buffers,  i.e.,  the  buffers  required 
to  convert  the  data  to  and  from  the  traditional  line-by-line  scanning  order.  Another  area 
of  future  research  is  to  extend  the  2-D  retiming  formulations  to  higher  dimensions.  This 
problem,  which  is  by  no  means  trivial,  has  applications  in  the  very  popular  area  of  digital 
video  processing. 

Finally,  one  research  topic,  which  we  have  not  been  able  to  address,  includes  most  of 
the  topics  covered  in  this  thesis.  This  topic  is  to  combine  2-D  retiming,  multirate  folding, 
and  register  minimization  to  develop  a  multirate/multi-dimensional  folding  transforma¬ 
tion.  Such  a  transformation  would  be  useful  for  designing  new  two-dimensional  discrete 
wavelet  transform  architectures  [73]  [74]. 
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